1 Introduction

Graphs are useful data structures for capturing relationships between entities in various complex interconnected systems, such as social relationships [1], protein interactions [2], commodity co-purchasing [3], and co-citations [4]. Many fundamental tasks on graphs involve making predictions over nodes, such as predicting labels for unlabeled nodes according to the graph structure and node attributes. In practical scenarios, the availability of labeled nodes is often limited due to the resource-intensive and time-consuming nature of the annotation process [5, 6]. Consequently, the scarcity of labeled examples renders the task of classifying nodes in a semi-supervised manner both challenging and crucial.

In recent years, deep graph neural networks [7,8,9,10] have garnered considerable attention and witnessed notable advancements in semi-supervised node classification. Among them, graph convolutional networks (GCNs) have emerged as a prominent approach. Early GCNs, such as GCN [11] and the hierarchical GCN model [12], are based on the graph convolution process, which focuses on propagating and aggregating feature information between adjacent nodes. These models solely learn node embeddings from the single-view perspective, often adopting a shallow architecture to mitigate the problem of over-smoothing [13,14,15] that may arise when dealing with node feature information. These characters often lead to the problem of capturing only a restricted amount of information by most single-view GCNs. This is unfavorable for learning discriminative node embeddings and devising effective classifiers, particularly in situations where the availability of labeled nodes is scarce in practical tasks.

Inspired by the recent success of contrastive learning [16,17,18] in computer vision, numerous multi-view GCNs have been specifically designed to capture node feature information from diverse perspectives. There are two primary approaches for generating node embeddings from multi-views: one is combining different kinds of information from the same graph, such as the models CG\(^3\) [19] and VCHN [20]; and the other is contrasting the information captured from one graph and its corresponding augmented one, such as the models MVGRL [21] and CSGNN [22]. Therefore, information from different views can be complementary by each other and provide valuable hints for promoting node embedding and classification.

Despite the remarkable advancements achieved by single-view and multi-view GCNs in semi-supervised node classification, it remains a daunting challenge to effectively classify nodes with only a limited number of labeled examples (lack of supervision). This is mainly attributed to the following two problems: (1) Graph convolutions focus on information propagations from neighboring nodes to the central node based on the graph topology, so that the similar features or commonality shared by connected nodes are learned. However, this may result in a loss of the node individuality, the acquisition of over-smoothed features, and the failure to distinguish nodes from different classes [13]. Such outcomes are unfavorable when aiming to learn a discriminative classifier. (2) The class label information of nodes is scarce and concise but plays a significant role in supervising model learning. The relation between nodes and classes should be taken full advantage of for mining valuable supervised signals and optimizing the processes of node embedding and classification. However, only limited useful supervised signals are mined from the relation by most GCNs [19, 20].

In this paper, we develop an Individuality-enhanced and Multi-granularity Consistency-preserving graph neural Network (IMCN) for semi-supervised node classification with scarce labeled data. Specifically, IMCN integrates a simple two-layer MLP as a supplementary encoder to amend the individuality of nodes damaged by graph convolutions. Then, IMCN enriches supervised signals by taking full advantage of the multi-granularity relations among nodes and classes. The main contributions are summarized as follows:

1) An individuality-enhanced and multi-granularity consistency-preserving graph neural Network is built for semi-supervised node classification, which can maintain the individuality and commonality of nodes simultaneously during the feature extraction process. The proposed method is highly effective, particularly when there are only a few labeled nodes available for model training.

2) Three consistency constraints at different granularity are designed to enrich the supervised information for model learning: the fine-grained one at the node level by an improved semi-supervised contrastive loss; the coarse-grained one at the semantic class level by aligning the prototypes of the same class learned from different encoders; and the middle-grained one at the node-to-class level by ensuring the identity of node-to-class relational distributions learned from two encoders.

3) Extensive experiments on six real-world networks from different fields verify that IMCN significantly outperforms the comparison methods for semi-supervised node classification with few labeled nodes. Especially, on the three public benchmark datasets Cora, CiteSeer, and PubMed, the classification accuracies of IMCN are more than 2.5% higher than baseline methods when only two or three labeled nodes per class are available for model training.

The remainder of the paper is structured as follows. Section 2 introduces some related work. Section 3 presents the details of the proposed model. Then, the experimental setup and results are introduced and analyzed in Section 4 and Section 5, respectively. Finally, Section 6 provides the main conclusions of this paper and gives ideas for further study.

2 Related work

In this section, some previous work on GCNs for semi-supervised node classification are briefly reviewed based on whether the method capture abundant information from different aspects.

Table 1 Differences among H-GCN, CG\(^3\), VCHN, and the proposed IMCN method

2.1 Single-view GCNs

Single-view GCNs usually learn node embeddings for classification by propagating and aggregating feature information between adjacent nodes in the graph from one aspect only. The classical and representative model GCN [11], which derived inspiration primarily from the convolutional operations on images, learns low-dimensional node embeddings through propagations and aggregations of nodes’ and their neighbors’ features. The GraphSAGE (SAmple and aggreGatE) model [23], a general inductive framework, generates node embeddings by sampling and aggregating features from one node’s local neighborhood and can efficiently generate node embedding for previously unseen data. The Graph ATtention network (GAT) [24] employs an attention mechanism to modify the traditional propagation and aggregation operations between one node and its neighbors in GCN. The Simple Graph Convolution network (SGC) [25] reduces the excess complexity in GCN by successively removing nonlinearities and collapsing weight matrices between consecutive layers. The Hierarchical Graph Convolution Network (H-GCN) [12] enlarges the receptive field of graph convolutional processes in GCN by fusing nodes with similar structures into super-nodes. The simplified multi-layer graph convolutional networks with dropout [6] combines SGC and the dropout regularization in deep learning and extends the shallow GCN model to the multi-layer GCN model to extract information from higher-order neighbors. This model can reduce redundant calculation and over-fitting of the multi-layer GCN to make it simple and efficient.

Despite the noticeable achievements of these single-view models, the main concern of lacking supervised information in semi-supervised node classification is still not well handled. Most single-view GCNs are usually shallow with one aspect only taken into consideration, and cannot obtain adequate information for effective classification when very limited nodes are labeled.

2.2 Multi-view GCNs

Different from the single-view GCNs, multi-view methods are specifically designed to capture abundant information from different aspects for improving learning and classification. In recent years, many multi-view GCNs have been proposed, and are classified into the following two classes.

One captures and combines feature information from two views of the same graph. The Contrastive GCN with Graph Generation (CG\(^3\)) [19] integrates H-GCN and a two-layer GCN to learn complementary information from local and global views of nodes and imposes the designed node-level contrastive and graph-level generative constraints on the embeddings learned by the above two encoders. The View-Consistent Heterogeneous Network (VCHN) [20] combines the classical methods GCN and GAT to learn node embeddings from spectral and spatial views and applies constraints on the predictions between two views to promote the supervision from one to the other.

The other captures feature information from one graph and its corresponding augmented graph and contrasts them to provide extra useful information for learning. The Deep Graph Infomax model (DGI) [26] aims at learning patch representations and the corresponding high-level summaries of graphs and related corrupted graphs by GCN, and then maximizes the mutual information between them. Similar to DGI, the contrastive Multi-View Graph Representation Learning model (MVGRL) [21] learns node representations for the graph and its corresponding corrupted one by two different GNNs and a shared MLP and generates corresponding graph representations from them by shared pooling and MLP layers. Then, contrastive constraints between node and graph representations are designed as important parts of the learning objective. The deep GRAph Contrastive rEpresentation learning model (GRACE) [27] first generates two correlated graph views by randomly performing corruption of removing edges and masking node features, then maximizes the agreement between node embeddings in these two views based on the idea of contrastive learning. The Contrastive Semi-supervised learning model based GNN (CSGNN) [22] employs a two-layer GCN as a teacher encoder to learn node representations for one graph and its corrupted graph, and then contrasts the latent vectors between nodes, edges, and labels from these two views for improving predictions. In the final stage, the predictions are distilled into the down-streaming student module.

The above two branches of multi-view GCNs can learn complementary information for boosting the discrimination of node embeddings and classification accuracy. However, these methods usually rely on the graph convolutional process, which enforces the encoders focusing on the commonality of adjacent nodes and damaging some individuality of them. In addition, these models do not take full advantage of the complex but valuable relation information among nodes and classes.

The methods H-GCN, CG\(^3\), and VCHN are closely related to the method proposed in this paper. The constraints and mechanism used in these four methods are listed in Table 1.

Fig. 1
figure 1

Framework of the proposed IMCN model: feature extraction in Stage 1; multi-granularity consistency constrains in Stage 2; and feature fusion and node classification in Stage 3

The following are the differences among the proposed IMCN method, H-GCN, CG\(^3\), and VCHN. Compared with the multi-view methods CG\(^3\) and VCHN, a GCN-based encoder is coupled with an MLP-based encoder in the proposed IMCN method, which can enhance the node individuality for learning discriminative node embedding vectors. For the node-level constraints in IMCN and CG\(^3\), the calculation of the former is much simpler than that of the latter by reducing repeated node-pair contrasts. In addition, IMCN takes full advantage of the complex relations among nodes and classes from the views of node-to-class distribution and class centroid alignment which are ignored by the other three methods.

3 IMCN model

This section first introduces the framework of the proposed Individuality-enhanced and Multi-granularity Consistency-preserving graph neural Network (IMCN) for semi-supervised node classification. Then, the critical components of IMCN are described in detail.

3.1 Framework overview

We design a novel graph neural network (as shown in Fig. 1) for semi-supervised node classification, which commits to taking both the individual- and common-feature information of nodes and preserving multi-granularity consistency between them to learn discriminative node representations for effective classification. The model learning and classification process of IMCN mainly includes the following three stages:

1) Feature extraction The individual- and common-feature information for nodes is extracted by two different encoders according to the graph topology and original node features.

2) Multi-granularity consistency constrains The multi-granularity consistency of feature information learned from the encoders is preserved according to the relations among nodes and classes.

3) Feature fusion and node classification A trade-off is made between the constrained individuality and commonality of nodes and then a multi-objective loss function is established to obtain the optimized model for node classification.

3.2 Feature extraction

Different views contain quite different information to describe the same object, which can provide complementary information to improve model learning. Multi-view learning has grown in popularity as a result of this concept. Meanwhile, contrastive learning has received intensive research in recent years showing that contrasting congruent and incongruent views of objects can help the algorithms learn expressive representations [28,29,30,31,32]. Inspired by these ideas, two different views are established for a graph and applied to learn discriminative node representations for classification.

Although some augmentation strategies are proposed to generate related graphs with different views, such as node dropping and edge perturbation in [21, 22, 33], they may destroy the original graph topology and degrade the performance of graph convolutional networks. Unlike the previous approaches, in this paper, the node itself is seen as a local view and the nodes with adjacent neighbor structures as a global view for a graph. These two views are obviously different from those of node2vec [34] which takes the width and depth of the deep walks controlled by two parameters as local and global views.

From the global structure view, many public GCN-based models can be used to capture common-feature information among adjacent nodes. Here, the effective GCN-based model H-GCN [12] is adopted as the global encoder in the proposed IMCN model. H-GCN aggregates nodes that have equivalent or similar structures into hyper-nodes for graph convolution and then refines the coarsened graphs to the original graphs for restoring the representation for each node. Therefore, the receptive field for each node is enlarged, and more global and common-feature information of nodes can be comprehensively captured. The node feature matrix \(\textbf{X}\in {\mathbb R}^{m\times n}\) and adjacency matrix \(\textbf{A}\in {\mathbb R}^{m\times m}\) are input into the H-GCN encoder to generate low-dimension global node representations \(\textbf{H}^{global}\in {\mathbb R}^{m\times c}\) as follows:

$$\begin{aligned} \textbf{H}^{global} = \phi (\textbf{A},\textbf{X}), \end{aligned}$$
(1)

where m, n, and c (as the dimension of node embeddings) are the number of nodes, node features, and classes, respectively; and \(\phi (\cdot )\) denotes the processes of generating coarse graphs, graph convolution, and refining coarsened graphs in H-GCN.

However, those GCN-based encoders mainly focus on the commonality of linked nodes excessively and may lose the individuality of nodes in information propagation. This problem also exists in the local encoder which is designed by two-layer GCNs in [19]. In practice, the categories of nodes are mainly determined by their individual-feature information. Therefore, to compensate for the damaged individual-feature information learned from the global GCN-based encoder, the node itself is regarded as a local view and extract the individual-feature information of nodes by a simple two-layer MLP encoder with \(\textbf{X}\) as the only input. The local node low-dimension representations \(\mathbf{{H}}^{local} \in {\mathbb R}^{m\times c}\) can be obtained by IMCN as follows:

$$\begin{aligned} {\mathbf{{H}}^{local}} = \sigma (\mathbf{{X}}{\mathbf{{W}}^{(0)}}){\mathbf{{W}}^{(1)}}, \end{aligned}$$
(2)

where \(\textbf{W}^{(i)}\) and \(\sigma (\cdot )\) denote the trainable weight matrix and non-linear ReLU activation function [35]. Denote \(\textbf{W}^{(0)}\in {\mathbb R}^{n\times d}\), and \(\textbf{W}^{(1)}\in {\mathbb R}^{d\times c}\), where d is the feature dimension in the hidden layer. Since \(\textbf{H}^{local}\) is computed regardless of the structural information of graphs, the label information can be effectively propagated without the limitation of distance between nodes.

In order to make the feature information learned by two encoders in the same metric, \(\mathbf{{H}}^{local}\) and \(\mathbf{{H}}^{global}\) are normalized by the \(L_2\)-norm in the column direction before imposing the following multi-granularity consistency constraints and the classification stage.

3.3 Multi-granularity consistency constraints

For the individuality and commonality of nodes captured separately through the above feature extraction processes, it is reasonable to preserve the consistency between them for optimizing the encoding process. Inspired by human cognition and intelligence, data are analyzed from different granularities in IMCN. This strategy can lead the model to analyze data more comprehensively, utilize data more efficiently, and make more accurate decisions. The following part introduces the designed multi-granularity consistency constraints according to relations among nodes and classes in detail.

3.3.1 Node-level consistency constraint

Data tend to be analyzed intuitively from the node (sample) level, which can enforce the model focusing on the features of representative samples and having good generalization ability. From this level, some common information shared by local and global representations of one node is described as fine-grained node-level consistency.

In the proposed IMCN method, the vector distance of local and global representations is used to measure this fine-grained consistency. In detail, this constraint is defined with unsupervised and supervised parts as follows.

On the one hand, in order to utilize the abundance of unlabeled information effectively, an unsupervised node-level loss is defined to maintain the consistency between the local and global representations of the same node:

$$\begin{aligned} L_{node}^u = - \log \frac{{\sum \nolimits _{i = 1}^m {{e^{sim(\mathbf{{h}}_i^{local},\mathbf{{h}}_i^{global})}}} }}{{\sum \nolimits _{j,k = 1}^n {{e^{sim(\mathbf{{h}}_j^{local},\mathbf{{h}}_k^{global})}}} }}, \end{aligned}$$
(3)

where \(\mathbf{{h}}_i^{local}\) is the i-th row vector in \(\mathbf{{H}}^{local}\), and \(sim(\mathbf{{a}},\mathbf{{b}})\) is the cosine similarity between \(\textbf{a}\) and \(\textbf{b}\): \(sim(\mathbf{{a}},\mathbf{{b}}) = \frac{{\mathbf{{a}} \cdot \mathbf{{b}}}}{{|\mathbf{{a}} |\times |\mathbf{{b}} |}}\). By minimizing this loss, the representations of the same node from two views are expected to be similar, while those of different nodes are expected to be away from each other.

On the other hand, labeled nodes are scarce but can provide valuable semantic information for learning expressive node embeddings for easy classification. The consistency between local and global representations of labeled nodes belonging to the same class is maintained by a designed supervised loss as follows:

$$\begin{aligned} L_{node}^s = - \log \frac{{\sum \nolimits _{i,j = 1;{\textbf{y}_i} = {\textbf{y}_j}}^m {{e^{sim(\mathbf{{h}}_i^{local},\mathbf{{h}}_j^{global})}}} }}{{\sum \nolimits _{k,m = 1}^n {{e^{sim(\mathbf{{h}}_k^{local},\mathbf{{h}}_m^{global})}}} }}, \end{aligned}$$
(4)

where \(\textbf{y}_i\) is the one-hot coded class vector of the i-th node and \(\textbf{y}_i \in {\mathbb R}^{1\times c}\). Therefore, each labeled node from one view is contrasted with the labeled nodes belonging to the same class from the other view.

Note that in the above two loss terms, we expect node embeddings of the same node or nodes from the same class to be most similar simultaneously in the joint similarity distribution between all nodes, instead of the marginal distribution, as shown in Fig. 2. This is more reasonable and can avoid the following time-consuming duplication calculations in CG\(^3\) [19]: (1) reuse negative nodes (blue shaded and checked ones as shown in Fig. 2(a) and (c)) in contrastive learning; and (2) repeat calculations of the inner products between row vectors from \(\textbf{H}^{local}\) and \(\textbf{H}^{global}\) in each loss term.

Fig. 2
figure 2

Differences between the proposed IMCN method and CG\(^3\) [19] on calculating unsupervised and supervised node-level loss

Finally, the local-global consistency at the fine-grained node level is maintained by a node-wise regularization term defined with both the unsupervised and supervised information as the following formula:

$$\begin{aligned} {L_{node}} = L_{node}^s + L_{node}^u. \end{aligned}$$
(5)

Therefore, the learning process of IMCN for local and global node representations can complement and promote each other based on node features and semantic information.

3.3.2 Class-level consistency constraint

The performance of the model tends to be biased when trained only at the sample level. The reason is that different samples can share some common features and belong to the same class, and it is necessary for the model to distinguish between samples of different classes. Despite a small number of labeled nodes, their semantic category information is an important supplement for feature embedding. This is not taken into consideration in [19]. From the perspective of semantic category, the common information between the local and global views of nodes is named after coarse-grained class-level consistency.

Following but different from the idea in [36], prototypes for each class from local and global views are generated using the learned embeddings of the labeled nodes, then the distance between them are expected to be minimized. The following constraint is designed:

$$\begin{aligned} {L_{class}} = {\frac{1}{c}\sum \limits _{i = 1}^c {\left\| {\mathbf{{c}}_i^{local} - \mathbf{{c}}_i^{global}} \right\| _2^2} }, \end{aligned}$$
(6)

where \(\textbf{c}_i^{local} \in {\mathbb R}^c\) and \(\textbf{c}_i^{global} \in {\mathbb R}^c\) are the prototypes of the i-th class calculated by average aggregation of the learned local and global embeddings of the labeled nodes belonging to this class respectively, \(\left\| \cdot \right\| _2\) is the \(L_2\)-norm operator, and \(L_{class}\) is the mean-squared Euclidean distance of the corresponding class prototypes. Note that it is different from the magnet loss [37] which uses the k-means method to compute cluster centers for each class.

The representations of class prototypes are not stable during the model learning process and may forget valuable information learned before. Therefore, in the t-th iteration, we compute the class prototypes \(\textbf{c}_i^{local}\) and \(\textbf{c}_i^{global}\) in the way mentioned above, then add them to the prototype representations calculated after the last iteration for updating class prototype representations and suppressing the instability:

$$\begin{aligned} \textbf{c}_i^{local(t)}= & {} (1 - \mu ) \mathbf{{c}}_i^{local(t - 1)} + \mu \mathbf{{c}}_i^{local},\nonumber \\ \textbf{c}_i^{global(t)}= & {} (1 - \mu ) \mathbf{{c}}_i^{global(t - 1)} +\mu \mathbf{{c}}_i^{global}, \end{aligned}$$
(7)

where \(\mu \) is the balance weight for updating the class prototypes in the t-th iteration based on the prototype representation after \({t-1}\) iterations, and \(\mu \in [0,1)\).

3.3.3 Consistency constraint at the node-to-class level

Assume that the distribution around each prototype is isotropic Gaussian and that the distributions around the same class in the local and global views should be similar. Therefore, in addition to the consistencies at node and class levels, there must be some indispensable consistent information in the node-to-class relationship between local and global views. This is also not taken into consideration in [19].

To make the best use of unlabeled nodes, \(sim(\textbf{h}_i, \textbf{c}_j)\) is used to calculate the similarities between each node embedding and the obtained class prototypes, and then node-to-class relational distributions are generated for unlabeled nodes in local and global views according to the following expressions:

$$\begin{aligned} p_{ij}^{local}= & {} \frac{{{e^{sim(\mathbf{{h}}_i^{local},\mathbf{{c}}_j^{local})/\tau }}}}{{\sum \nolimits _{k = 1}^c {{e^{sim(\mathbf{{h}}_i^{local},\mathbf{{c}}_k^{local})/\tau }}} }},\nonumber \\ p_{ij}^{global}= & {} \frac{{{e^{sim(\mathbf{{h}}_i^{global},\mathbf{{c}}_j^{global})/\tau }}}}{{\sum \nolimits _{k = 1}^c {{e^{sim(\mathbf{{h}}_i^{global},\mathbf{{c}}_k^{global})/\tau }}}}}, \end{aligned}$$
(8)

where \(\tau > 0\) is a temperature hyper-parameter denoting the concentration of node embeddings around class prototypes, and a smaller \(\tau \) indicates a larger concentration.

Then, the distribution of the relation of one node to all classes can be represented as \(\textbf{p}_i = [p_{i1},p_{i2},...,p_{ic}]\). The node-to-class relational consistency between \(\textbf{p}_i^{local}\) and \(\textbf{p}_i^{global}\) is kept by minimizing the Kullback-Leibler divergence [38] between them as follows:

$$\begin{aligned} {L_{node2class}}= & {} \sum \limits _{i=1}^{m-l} {{D_{KL}}(\mathbf{{p}}_i^{local}\parallel \mathbf{{p}}_i^{global})}\nonumber \\= & {} \sum \limits _{i = 1}^{m - l} {g(\mathbf{{p}}_i^{local},\mathbf{{p}}_i^{global}) - g(\mathbf{{p}}_i^{local})},\nonumber \\ g(\mathbf{{p}}_i^{local},\mathbf{{p}}_i^{global})= & {} - \sum \limits _{j = 1}^c softmax (\mathbf{{p}}_i^{local})\nonumber \\{} & {} \times \ln (softmax (\mathbf{{p}}_i^{global})), \end{aligned}$$
(9)

where \(g(\mathbf{{p}}_i^{local})\) is omitted, \(softmax(\cdot )\) is the softmax function, and \(g(\mathbf{{p}}_i^{local},\mathbf{{p}}_i^{global})\) is implemented by the cross-entropy function according to [39]. In the proposed IMCN model, \(L_{node2class}\) is regarded as a middle-grained consistency constraint.

3.4 Feature fusion and node classification

Node representations with the individuality and commonality of nodes are generated by the designed encoders under the defined multi-granularity consistency constraints. That important information is integrated and complements each other to obtain the final node representations as follows:

$$\begin{aligned} \textbf{H} = \lambda {\mathbf{{H}}^{local}} + (1 - \lambda ){\textbf{H}^{global}}, \end{aligned}$$
(10)

where \(\lambda \) is a trade-off hyper-parameter between the individuality and commonality of nodes, and \(\lambda \in (0,1)\).

Then, the embedding vectors of l labeled nodes can be noted as \(\textbf{H}^{'} \in {\mathbb R}^{l \times c}\) from \(\textbf{H}\), and the cross-entropy classification loss is calculated to penalize the differences between the predicted labels \(\hat{\textbf{Y}} = softmax (\textbf{H}^{'})\) and the ground truth \(\textbf{Y}\in {\mathbb R}^{l\times c}\) of the labeled nodes as follows:

$$\begin{aligned} {L_{cross}} = - \sum \limits _{i = 1}^l {\sum \limits _{j = 1}^c {{{\mathbf{{Y}}}_{ij}}\ln {\hat{\textbf{Y}}_{ij}}}}. \end{aligned}$$
(11)

Finally, the proposed semi-supervised classification model IMCN is trained with the overall loss function expressed as follows:

$$\begin{aligned} L = {L_{cross}} + \alpha {L_{node}} + \beta {L_{class} + \gamma {L_{node2class}}}, \end{aligned}$$
(12)

where \(\alpha \), \(\beta \), and \(\gamma \) are three adjustable hyper-parameters to measure the importance of multi-granularity consistencies respectively. The training process of IMCN is sketched in Algorithm 1.

Algorithm 1
figure d

Individuality-enhanced and Multi-granularity Consistency-preserving graph neural Network (IMCN).

4 Experimental setup

This section presents the experimental setup from the following three perspectives: (1) benchmark datasets used for training and testing the model; (2) baseline models compared with the proposed model; and (3) parameter settings for the proposed model in the series of experiments.

4.1 Datasets

Six benchmark datasets are used in the experiments for a comprehensive comparison between the proposed method and the state-of-the-art methods, including three undirected citation networks from [40], two co-purchasing networks segmented from the Amazon co-purchasing graph [41], and one co-authorship network [42] from the KDD Cup 2016 challenge.Footnote 1 Detailed statistics of these datasets are summarized in Table 2, where the density of each dataset is defined as the ratio between the number of actual edges in the dataset and the edges in its corresponding fully connected graph.

Table 2 Dataset statistics

Following the data preprocessing in [19], each dataset is split into training, validation, and test sets as follows: (1) For the first three citation networks, twenty labeled nodes per class are used as the training set, 500 nodes, and 1,000 nodes as the validation and test sets, respectively. (2) For the other three networks, thirty labeled nodes per class are used as the training set, thirty nodes per class as the validation set, and the rest as the test set.

4.2 Comparison models

To verify the effectiveness of the proposed model, a comparison is made between the proposed IMCN model and ten other baseline methods, including four basic deep graph models (GCN [11], GAT [24], SGC [25], and H-GCN [12]), and six GCN-based contrastive models (DGI [26], GMI [43], MVGRL [21], GRACE [27], CG\(^3\) [19], and VCHN [20]). The description of the details of these methods is as follows.

1) GCN [11] produces node embedding vectors by a recursive average neighborhood aggregation scheme. It is derived from the related work of conducting graph convolutions in the spectral domain [44].

2) GAT [24] generates node embedding vectors by modeling the differences between the node and its one-hop neighbors.

3) SGC [25] reduces the excess complexity in GCN by removing nonlinearities and collapsing weight matrices between consecutive layers.

4) DGI [26] generates node embeddings and graph summary vector for the original input graph and constructs a corrupted graph to obtain negative node embeddings with the same GNN encoder. Then DGI aims at maximizing the mutual information between positive node embeddings and the graph summary vector and minimizing it between negative node embeddings and the graph summary vector.

5) GMI [43], different from DGI, focuses on maximizing the mutual information of feature and edge between the input graph and the output graph of the encoder.

6) MVGRL [21] uses graph diffusion to generate an additional structural view of a graph, then original-view and diffusion-view graphs are fed to GNNs and shared MLP to learn node representations. The learned features are then fed to a graph pooling layer and a shared MLP to learn graph representations. A discriminator contrasts node representations from one view with graph representation of another view and vice versa and scores the agreement between representations which is used as the training signal.

7) GRACE [27] jointly corrupts the input graph at both topology and node attribute levels, such as removing edges and masking node features, to provide diverse contexts for nodes in different views. Then contrastive learning is conducted between node embeddings from two views.

8) H-GCN [12] is an improved GCN-based model that expands the receptive field of graph convolutions in GCN.

9) CG3 [19] employs the H-GCN model and a two-layer GCN module to obtain local and global node embeddings and designs a semi-supervised node-level contrastive loss and a graph-level generative loss to optimize the model learning process.

10) VCHN [20] uses a two-layer GCN module and a two-layer GAT module to obtain latent features from spectral and spatial views and designs a strategy to generate confident pseudo-labels for unsupervised nodes.

Note that, in the last experiment of Section 5, IMCN with a two-layer GCN as the local encoder is added to illustrate the effectiveness of the proposed scheme in this paper.

Fig. 3
figure 3

Classification accuracies of the proposed IMCN model with different \(\alpha \), \(\beta \), and \(\gamma \)

4.3 Experimental settings

The proposed IMCN model was trained using the Adam optimizer with 500 epochs and the following settings: (1) The ReLU function is adopted as the non-linear activation of hidden layers. (2) The output dimension of local and global node representations is fixed to the number of classes. The dimensions of hidden layers, learning rate, weight decay, and dropout ratio are searched in \(\{32, 64, 128\}\), \(\{0.1,0.05,0.01\}\), \(\{0.01, 0.005, 0.001, 0.0005\}\), and \(\{0,0.1,0.2,0.3,0.4,0.5,\)\(0.6,0.7,0.8,0.9\}\), respectively. (3) The hyper-parameters \(\mu \), \(\lambda \), and \(\tau \) in the proposed IMCN model are searched in \(\{0.1, 0.2, 0.3, 0.4, 0.5\}\). (4) The hyper-parameters \(\alpha \), \(\beta \), and \(\gamma \) for the trade-off among three consistencies at different granularity are tuned in \(\{0.1, 0.5, 1, 1.5, 2\}\). In each experiment, the proposed IMCN model is run 10 random trials and the mean and standard deviation of the best test classification accuracy is reported. The results of the comparison methods are directly excerpted from the original papers. If not, corresponding experiments are conducted to obtain the results. The code and datasets are publicly available at https://github.com/xinya0817/IMCN.

5 Experimental results and analysis

This section presents the experimental results and discusses the performance of IMCN from the following seven aspects: (1) Performance of IMCN with different weights of the node-level consistent constraint, the class-level one, and the one at the node-to-class level; (2) Performance of IMCN with different updating rates for learning class prototypes, different temperatures for calculating the node-to-class distribution, and different weights for the local embedding in the final embedding; (3) Visualization of the original nodes and node embeddings learned by IMCN and its part modules; (4) Ablation study of IMCN with different loss terms; (5) Performance of IMCN on alleviating over-smoothing; (6) Performance of IMCN and comparison methods with scarce labeled training data; and (7) Performance of IMCN and baselines on common benchmark datasets. In all tables of experimental results, the highest record on each dataset is highlighted in bold.

5.1 Performance of IMCN with different weights on the consistent constraints

Experiments are carried out to determine the effectiveness of three local-global consistent constraints in IMCN. To verify the performance of the proposed IMCN model on very limited labeled training nodes, two citation networks (a small one and a relatively big one) are used with label rates equal 0.5% for Cora and 0.03% for PubMed. When different values are set for one hyper-parameter, the other two hyper-parameters are fixed.

The classification accuracies of IMCN with different values for \(\alpha \), \(\beta \), and \(\gamma \) are shown in Fig. 3, and the following observations can be obtained:

(1) IMCN obtained the best classification result on Cora when \(\alpha =1.5\), \(\beta =1\), and \(\gamma =2\). This is significantly superior to the results when \(\alpha \), \(\beta \), or \(\gamma \) equals 0.1.

(2) IMCN got the best result on PubMed when \(\alpha =0.5\), \(\beta =1\), and \(\gamma =0.1\), which is obviously much better when these hyper-parameters are set to other values. Therefore, the impacts of three different granularity consistency levels on the model are quantified.

(3) The IMCN model demonstrates satisfactory performance when the values of \(\alpha \) and \(\beta \) are approximately 1, regardless of whether it is applied to Cora or PubMed datasets. However, the same level of sensitivity was not observed in relation to the parameter \(\gamma \). This indicates that IMCN is considerably more responsive to the weight of the node-to-class loss. As a result, it is recommended to set the parameters \(\alpha \) and \(\beta \) to 1, while the parameter \(\gamma \) may require careful fine-tuning in domain-specific applications.

Fig. 4
figure 4

Classification accuracies of the proposed IMCN model with different \(\mu \), \(\tau \) and \(\lambda \)

5.2 Performance of IMCN with different parameters \(\mu \), \(\tau \), \(\lambda \)

This part discusses the impacts of different rates \(\mu \) for updating class prototypes, different temperatures \(\tau \) for the node-to-class distribution, and different weights \(\lambda \) of the local embedding in the final embedding on the performance of the proposed IMCN model. Classification experiments are conducted on Cora and PubMed with 0.5% and 0.03% labeled nodes, respectively. From the results shown in Fig. 4, the following observations can be obtained:

(1) IMCN performs best when \(\mu \) is set to 0.3 on Cora and 0.4 on PubMed. Smaller or bigger values for \(\mu \) cannot ensure IMCN gets the ideal performance, which implies that appropriate updating speed for class prototypes is important for learning stable and expressive node representations. On the one hand, when \(\mu \) is very small, node embeddings are unable to obtain new useful information in a timely manner. On the other hand, when \(\mu \) is very large, however, the important information learned previously cannot be retained.

Fig. 5
figure 5

Two-dimension visualization of original nodes and node embeddings obtained by MLP, H-GCN, and the proposed IMCN model on Cora

(2) In general, a low value for the temperature hyper-parameter \(\tau \) ensures that IMCN achieves the best performance, as seen on Cora. This is because that small temperature hyper-parameter can ensure a high concentration of node-to-class distribution. However, the superior result on PubMed obtained when \(\tau =0.3\) in IMCN was used may be attributed to PubMed’s relatively large scale and sparse structure.

(3) The best classification accuracy was got when \(\lambda =0.3\) for IMCN both on Cora and PubMed. A much smaller or bigger percentage of local information in the final node representations cannot obtain ideal results. This is because the hierarchical GCN module takes the feature information of the node into learning, but some are damaged by the propagation and aggregation operations of GCN layers.

(4) IMCN demonstrates strong performance when the values of \(\mu \) and \(\lambda \) are set to 0.4 and 0.3, respectively, whether applied to Cora or PubMed datasets. However, the same level of performance consistency was not observed in relation to the parameter \(\tau \). This indicates that IMCN is highly sensitive to the temperature parameter in the node-to-class loss. Therefore, it is recommended to set the parameters \(\mu \) and \(\lambda \) to 0.4 and 0.3, respectively, while the parameter \(\tau \) may require fine-tuning in practical tasks.

5.3 Visualization of node embeddings learned by different models

The t-SNE algorithm [45] is used to visualize the original nodes of Cora [40] with a label rate of 0.5% and their embedding representations learned by a two-layer MLP (only the features of node itself are used), the representative model H-GCN [12] (feature information propagated from multi-hop neighbors is used), and the proposed IMCN model (which integrate feature information of node itself and from its neighbors). All original and embedded nodes are projected into a two-dimensional space for visualization and shown in Fig. 5.

From the results, the following observations can be obtained. After the embedding process of a simple two-layer MLP model, nodes from different classes are still mixed and cannot be clearly distinguished. H-GCN can group most embedded nodes into their classes correctly, however, many nodes from different classes in the central area of Fig. 5(c) are very close, which can easily lead to misclassification. Compared with the above two methods, the proposed IMCN model can push nodes from different classes away while increasing the distance between these classes, ensuring low classification errors.

5.4 Ablation study of IMCN

On three citation networks, ablation experiments are carried out to demonstrate the effectiveness of various local-global consistent constraints in IMCN. The label rates of Cora, CiteSeer, and PubMed are 0.5%, 0.5%, and 0.03%, respectively. Experimental results are listed in Table 3.

Table 3 Ablation study of the proposed IMCN method with different loss terms (%)

From the results, it can be seen that the designed multi-granularity constraints (feature embedding agreement, semantic class alignment, and the identity of node-to-class relational distribution) make a significant improvement. The consistency of the local and global perspectives at multiple levels can reveal their complementary features between the individuality and commonality of nodes. By combining these constraints, IMCN can make full use of both limited labeled nodes and abundant unlabeled nodes and integrate useful information from the two views.

5.5 Performance of IMCN on alleviating over-smoothing

Through a series of experiments, it was observed that the nine-layer and twelve-layer GCNs cause all node representations to be similar and indistinguishable on Cora and CiteSeer, respectively, as shown in Fig. 6(a) and (c). Then, the proposed IMCN models with corresponding GCNs as global encoders obtain low-dimensional node representations on these two datasets respectively, as shown in Fig. 6(b) and (d). From these figures, it can be seen that the proposed IMCN method obviously alleviates the over-smoothing problem resulting from multiple convolution operations.

Fig. 6
figure 6

Two-dimension visualization of node embeddings obtained by multi-layer GCNs and the proposed IMCN models with the corresponding multi-layer GCNs on Cora and CiteSeer

Table 4 Classification accuracies of the proposed IMCN method and comparison methods on Cora, CiteSeer, and PubMed with very limited labeled nodes (%)

5.6 Performance of IMCN and comparison methods on datasets with scarce labeled nodes

In this part, some experiments are conducted to verify the effectiveness of the proposed IMCN method in learning expressive node embeddings with only a few nodes labeled in the training process. In the experiments, three benchmark graph datasets (Cora, CiteSeer, and PubMed) with different label rates are used: 0.5%, 1%, 3% labeled nodes for Cora and CiteSeer; and 0.03%, 0.05%, 0.01% labeled nodes for PubMed, respectively. The classification accuracies of all methods are listed in Table 4, and the following three observations are obtained.

(1) The proposed IMCN method outperforms most baselines in terms of different label rates on three datasets, especially when there are very few labeled nodes. For example, on CiteSeer with 0.5% labeled nodes, the classification accuracy of IMCN is significantly higher than that of other methods, which is 6.5% higher than the method ranked second. This mainly attributes that IMCN can capture the abundant individuality and commonality of nodes with the consideration of the complex relations among nodes.

(2) On Cora with a label rate of 1% and PubMed with a label rate of 0.1%, IMCN’s performance is not superior to VCHN, which is ranked first, but it is clearly better than the method ranked third. Concretely, the classification accuracy of the proposed IMCN method is 2.6% and 0.1% lower than VCHN, but is 3.7% and 0.4% higher than H-GCN on these two datasets, respectively.

(3) The performance of the proposed IMCN model is quite good when density is high, especially on Cora and CiteSeer. This is obviously in contrast with that on PubMed which is much sparser than the first two datasets according to their density information in Table 2.

5.7 Performance of IMCN and baselines on common benchmark datasets

In this section, classification experiments are conducted on six various networks with common dissociation to assess the performance of the proposed IMCN method and compare it with the baseline methods. Note that IMCN1 and IMCN2 are the methods proposed in this paper with the H-GCN model and a two-layer GCN as the global encoder, respectively. In addition, there are two designed comparison methods corresponding to IMCN1 and IMCN2 without multi-granularity consistency constraints: IEN1 takes the H-GCN model and a two-layer MLP as encoders; and IEN2 uses a two-layer GCN and a two-layer MLP as encoders.

Table 5 Classification accuracies of the proposed IMCN method and comparison methods on six benchmark datasets (%)
Fig. 7
figure 7

Average ranks of all methods with the critical distance (CD) for classification accuracy according to the Nemenyi test [46]

Table 6 Macro-F\(_{1}\) results of the proposed IMCN methods and comparison methods on six benchmark datasets (%)

First, the classification accuracies of all methods are shown in Table 5, where the results ranking in the first two are marked in bold. The following three conclusions can be drawn:

(1) The performance of IMCN1 and IMCN2 is obviously better than the first three traditional models, GCN, GAT, and SGC, which are based on a single view. This is thanks to two different views of the graph combined in IMCN to capture both shared and complementary information from them.

(2) The proposed methods IMCN1 and IMCN2 outperform the contrastive learning-based comparison methods on all experimental datasets. Especially, IMCN1 and IMCN2 obtain the best classification accuracies on Photo which are about 2.9% and 3.3% higher than that of CG\(^3\) which ranked at third and proposed in recent years. This mainly owes to the individuality of nodes enhanced by the designed simple local encoder and the multi-granularity relations among nodes and classes maintained by the designed consistency constraints.

(3) IMCN1 and IMCN2 are obviously better than the corresponding IEN1 and IEN2 on most datasets. For example, on the citation network CiteSeer, the node classification accuracies are promoted by 3.7% and 5.7% with the designed multi-granularity consistency constraints of the proposed method.

Table 7 FLOPs, trainable parameters, test time, and running memory of GCN, H-GCN, and the corresponding IMCNs

Then, the widely used statistical Nemenyi test [46] is employed to conduct a comprehensive analysis of the significant difference among the proposed IMCN methods and 12 comparison methods on six datasets with the classification accuracies in Table 5. The average ranks of all the methods with the critical distance (CD) are plotted as shown in Fig. 7. The following observations are obtained. The classification accuracies of the proposed IMCN1, IMCN2, IEN1, IEN2, CG\(^3\), H-GCN, and MVGRL are statistically better than those of other seven comparison methods. There is no consistent evidence to indicate the statistical differences among IMCN1, IMCN2, IEN1, IEN2, CG\(^3\), H-GCN, and MVGRL on the metric of classification accuracy.

According to the ranks of the proposed methods and 12 comparison methods, macro-F\(_1\) results of the first six methods (including IMCN1, IMCN2, IEN1, IEN2, CG\(^3\) and H-GCN), the latest method VCHN, and the baseline GCN are listed and compared in Table 6. The best results are marked in bold. It can be seen that, under the macro-F\(_1\) metric, IMCNs’ performance is much better than that of their corresponding IENs, GCN, H-GCN, and VCHN in most cases, especially obvious on Cora. This is mainly attributed to the specific designed individuality-enhanced module and three consistency constraints at different levels.

Finally, Table 7 shows the FLOPs, trainable parameters, test time, and running memory of the proposed methods IMCN1 and IMCN2 and the corresponding baseline methods (H-GCN and GCN). It can be seen that the space and time complexity of the proposed methods is slightly higher than the corresponding baseline methods while maintaining competitive performance. This is because of the added two-layer MLP to enhance individuality and the three consistency constraints between local and global encoders.

6 Conclusions and future work

In this paper, we proposed a graph neural network called Individuality-enhanced and Multi-granularity Consistency-preserving Network (IMCN). IMCN aims to take advantage of the limited, yet valuable, supervised information available in labeled data and effectively enhance the classification capability. On the one hand, a simple MLP module was combined with the original GCN-based model to enhance individual information in learning node representations. On the other hand, the complex relations among nodes and classes were taken full advantage of by the designed three consistency constraints for optimizing the encoding processes of two encoders. Extensive experiments were conducted on various public benchmark datasets, and the results demonstrated the effectiveness of the proposed IMCN method in solving node classification tasks with extremely limited labeled nodes.

The proposed IMCN model has strict requirements on the input graph, which assumes that the entire structure of this graph is available to capture the common-feature information of nodes. Moreover, IMCN has a considerable number of hyperparameters that require tuning and can be inefficient when dealing with very large networks. In the future, our focus will be on developing scalable and efficient deep semi-supervised node classification methods specifically designed for large-scale graph datasets. We also aim to explore automatic parameter tuning techniques using optimization methods, such as [47].