Hierarchical message-passing graph neural networks

Graph Neural Networks (GNNs) have become a prominent approach to machine learning with graphs and have been increasingly applied in a multitude of domains. Nevertheless, since most existing GNN models are based on flat message-passing mechanisms, two limitations need to be tackled: (i) they are costly in encoding long-range information spanning the graph structure; (ii) they are failing to encode features in the high-order neighbourhood in the graphs as they only perform information aggregation across the observed edges in the original graph. To deal with these two issues, we propose a novel Hierarchical Message-passing Graph Neural Networks framework. The key idea is generating a hierarchical structure that re-organises all nodes in a flat graph into multi-level super graphs, along with innovative intra- and inter-level propagation manners. The derived hierarchy creates shortcuts connecting far-away nodes so that informative long-range interactions can be efficiently accessed via message passing and incorporates meso- and macro-level semantics into the learned node representations. We present the first model to implement this framework, termed Hierarchical Community-aware Graph Neural Network (HC-GNN), with the assistance of a hierarchical community detection algorithm. The theoretical analysis illustrates HC-GNN’s remarkable capacity in capturing long-range information without introducing heavy additional computation complexity. Empirical experiments conducted on 9 datasets under transductive, inductive, and few-shot settings exhibit that HC-GNN can outperform state-of-the-art GNN models in network analysis tasks, including node classification, link prediction, and community detection. Moreover, the model analysis further demonstrates HC-GNN’s robustness facing graph sparsity and the flexibility in incorporating different GNN encoders.


Introduction
Graphs are a ubiquitous data structure that models objects and their relationships within complex systems, such as social networks, biological networks, recommendation systems, etc [1].Learning node representation from a large graph has been proved as a useful approach for a wide variety of network analysis tasks, including link prediction [2], node classification [3] and community detection [4].
Graph Neural Networks (GNNs) are currently one of the most promising paradigms to learn and exploit node representations due to their effective ability to encode node features and graph topology in transductive, inductive, and few-shot settings [5].Many existing GNN models follow a similar flat message-passing principle where information is iteratively passed between adjacent nodes along observed edges.Such a paradigm is able to incorporate local information surrounded by each node [6].However, it has been proven to suffer from several drawbacks [7][8][9].
Among these deficiencies of flat message-passing GNNs, the limited ability for information aggregation over long-range has attracted significant attention [10], since most graph-related tasks require the interactions between nodes that are not directly connected [11].That said, flat message-passing GNNs struggle in capturing dependencies between distant node pairs.Inspired by the outstanding effectiveness of very deep neural network models has been demonstrated in computer vision and natural language processing domains [12], a natural solution is stacking lots of GNN layers together to directly increase the receptive field of each node.Consequently, deeper models have been proposed by simplifying the aggregation design of GNNs and accompanied by well-designed normalisation units or specific gradient descent method [13,14].Nevertheless, Alon and Yahav have theoretically shown that flat GNNs are susceptible to being a bottleneck when aggregating messages across a long path and lead to severe over-squashing issues [11].
On the other hand, in this paper, we further argue another crucial deficiency of flat message-passing GNNs is that they rely on only aggregating messages across the observed topological structure.The hierarchical semantics behind the graph structure provides useful information and should be incorporated into the learning of node representations.Taking the collaboration network in Fig. 1-(a) as an example; author nodes highlighted in light yellow come from the same institutes, and nodes filled with different colours indicate authors in various research areas.In order to generate the node representation of a given author, existing GNNs mainly capture the co-author level information depending on the explicit graph structure.However, information hidden at meso and macro levels is neglected.In the example of Fig. 1, meso-level information means authors belong to the same institutes and their connections to adjacent institutes.Macro-level information refers to authors of the same research areas and their relationship with related research areas.Both mesoand macro-level knowledge cannot be directly modelled through flat message passing via observed edges.
In this paper, we investigate the idea of a hierarchical message-passing mechanism to enhance the information aggregation pipeline of GNNs.The ultimate goal is to make the node representation learning process aware of both long-range interactive information and implicit multi-resolution semantics within the graph.
We note that a few graph pooling approaches have recently delivered various attempts to use the hierarchical structure idea [15][16][17][18][19]. g-U-Net [15] and GXN [19] employ a bottom-up and top-down pooling operation; however, they do not allow long-range message-passing.DiffPool [16], AttPool [17] and ASAP [18] target at graph classification tasks instead of enabling node representations to capture long-range dependencies and multi-grained semantics of one graph.Moreover, P-GNNs [20] create a different information aggregation mechanism that utilises sampled anchor nodes to impose topological position information into learning node representations.While P-GNNs can capture global information, the hierarchical semantics mentioned above is still overlooked, and the global message-passing is not realised.Besides, the anchor-set sampling process is time-consuming for large graphs, and it cannot work well under the inductive setting.Specifically, we present a novel framework, Hierarchical Message-passing Graph Neural Networks (HMGNNs), elaborated in Fig. 1.In detail, HMGNNs can be organised into the following four phases.
(i) Hierarchical structure generation.To overcome long-distance obstacles in the process of GNN message-passing, we propose to use a hierarchical structure to reduce the size of graph G gradually, where nodes at each level t are integrated into different super nodes (s t+1 1 , . . ., s t+1 n ) at each level t+1.(ii) t-level super graph construction.In order to allow the message passing among generated same-level super nodes, we construct a super graph G t based on the connections between nodes at its lower level t−1.(iii) Hierarchical message propagation.With the generated hierarchical structure for a given graph, we develop three propagation manners, including bottom-up, within-level and top-down.(iv) Model learning.Last, we leverage task-specific loss functions and a gradient descent procedure to train the model.
Designing a feasible hierarchical structure is crucial for HMGNNs, as the hierarchical structure determines how messages can be passed through different levels and what kind of meso-and macro-level information to be encoded in node representations.In this paper, we consider (but are not restricted to) network communities.As a natural graph property, the community has been proved very useful for many graph mining tasks [21,22].Lots of community detection methods can generate hierarchical community structures.Here, we propose an implementation model for the proposed framework, Hierarchical Community-aware Graph Neural Network (HC-GNN).HC-GNN exploits a well-known hierarchical community detection method, i.e., the Louvain method [23] to build up the hierarchical structure, which is then used for the hierarchical message-passing mechanism.
The theoretical analysis illustrates HC-GNN's remarkable capacity in capturing long-range information without introducing heavy additional computation complexity.Extensive empirical experiments are conducted on 9 graph datasets to reveal the performance of HC-GNN on a variety of tasks, i.e., link prediction, node classification, and community detection, under transductive, inductive and few-shot settings.The results show that HC-GNN consistently outperforms a set of state-of-the-art approaches for link prediction and node classification.In the few-shot learning setting, where only 5 samples of each label are used to train the model, HC-GNN achieves a significant performance improvement, up to 16.4%.We also deliver a few empirical insights: (a) the lowest level contributes most to node representations; (b) how to generate the hierarchical structure has a significant impact on the quality of node representations; (c) HC-GNN maintains an outstanding performance for graphs with different levels of sparsity perturbation; (d) HC-GNN possess significant flexibility in incorporating different GNN encoders, which means HC-GNN can achieve superior performance with advanced flat GNN encoders.
Contributions.The contribution of this paper is five-fold: 1. We propose a novel Hierarchical Message-passing Graph Neural Networks framework, which allows nodes to conveniently capture informative longrange interactions and encode multi-grained semantics hidden behind the given graph.2. We present the first implementation of our framework, namely HC-GNN 1 , by detecting and utilising hierarchical community structures for message passing.3. Theoretical analysis demonstrate the efficiency and the capacity of HC-GNN in capturing long-range interactions in graphs.4. Experimental results show that HC-GNN significantly outperforms competing GNN methods on several prediction tasks under transductive, inductive, and few-shot settings.5. Further empirical analysis is conducted to derive insights into the impact of the hierarchical structure and graph sparsity on HC-GNN and confirm its flexibility in incorporating different GNN encoders.
The rest of this paper is organised as follows.We begin by briefly reviewing additional related work in Sec. 2. Then in Sec. 3, we introduce the preliminaries of this study and state the research problem.In Sec. 4, we introduce our proposed framework Hierarchical Message-passing Graph Neural Networks and its first implementation, HC-GNN.Experimental results and empirical analysis are shown in Sec. 5. Finally, we conclude the paper and discuss the future work in Sec. 6.

Related Work
Flat message-passing GNNs.They perform graph convolution, directly aggregate node features from neighbours in the given graph, and stack multiple GNN layers to capture long-range node dependencies [7,[24][25][26].However, they were observed not to benefit from more than a few layers, and recent studies have theoretically expressed this problem as over-smoothing [10,11], i.e., node representations become indistinguishable when the number of GNN layers increases.On the other hand, GraphRNA [27] presents graph recurrent networks to capture interactions between far-away nodes.Still, we cannot apply it to inductive learning settings because they rely on attributed random walks and the recurrent aggregations introduce high computation costs.P-GNNs [20] incorporate a novel global information aggregation mechanism based on the distance of a given target node to each anchor set.However, P-GNNs sacrifice the ability of existing GNNs on inductive node-wise tasks.As shown in their paper, they only support pairwise node classification tasks, i.e., comparing if two nodes have the same class label instead of predicting the class label of each individual node.Additionally, the anchor-set sampling operation brings a high computational cost for large-size graphs.Recently, deeper flat GNNs have been proposed by simplifying the aggregation design of GNNs and accompanied by well-designed normalisation units [13] or specific gradient descent methods [14].Nevertheless, [11] has theoretically shown that flat GNNs are susceptible to being a bottleneck when aggregating messages across a long path and lead to severe over-squashing issues.Moreover, we will theoretically discuss the advantages of our method compared with flat GNNs in Sec.4.3, in terms of long-range interactive capability and complexity.
Hierarchical representation GNNs.In recent years, some studies generalise the pooling mechanism of computer vision [28] to GNNs for graph representation learning [15-19, 29, 30].However, most of them, such as Diff-Pool [16], AttPool [17] and ASAP [18], are designed for graph classification tasks rather than learning node representations to capture long-range dependencies and multi-resolution semantics.Thus they cannot be directly applied to node-level tasks.g-U-Net [15] defines a similarity-based pooling operator to construct the hierarchical structure, and GXN [19] designs another infomax pooling operator, they implement bottom-up and top-down operations.Despite the success of g-U-Net and GXN in producing graph-level representations, they cannot model the multi-grained semantics and realise long-range message-passing.HARP [31] and LouvainNE [32] are two unsupervised network representation approaches that adopt a hierarchical structure, but they do not support the supervised training paradigm to optimise for specific tasks, and they cannot be applied with inductive settings.
More recently, HGNet [30] leverages multi-resolution representations of a graph to facilitate capturing long-range interactions.Below, we discuss the main differences between HGNet and HC-GNN.HC-GNN designs different efficient and effective bottom-up and top-down propagation mechanisms to realise elegant hierarchical message-passing rather than directly applying pooling and relational GCN, respectively.We further provide the theoretical analysis to demonstrate the efficiency and capacity of HC-GNN, such analysis has not been performed on HGNet.We also provide a much more careful and comprehensive set of experimental studies to validate the effectiveness of HC-GNN, including comparing learning settings on node classification (transductive, inductive, and few-sot), comparing to more recent competing flat GNN methods, comparing to state-of-the-art hierarchical GNN models, evaluating on the link prediction task, and in-depth analysis on graph sparsity and primary GNN encoders (Sec.5).Last but not least, in addition to capturing long-range interactions, we further deeply discuss the benefits and the usefulness of the hidden hierarchical structure in a graph.
Table 8 summarises the critical advantages of the proposed HC-GNN and compares it with a number of state-of-the-art methods published recently.We are the first to present the hierarchical message passing to efficiently model long-range informative interaction and multi-grained semantics.In addition, our HC-GNN can utilise the community structures and be applied for transductive, inductive and few-shot inferences.

Problem Statement
An attributed graph with n nodes can be represented as G = (V, E, X), where V = {v 1 , v 2 , . . ., v n } is the node set, E ⊆ V × V denotes the set of edges, and X = {x 1 , x 2 , . . ., x n } ∈ R n×π is the feature matrix, in which each vector x i ∈ X is the feature vector associated with node v i , and π is the dimension of input feature vector of each node.For subsequent discussion, we summarise V and E into an adjacency matrix A ∈ {0, 1} n×n .Problem definition.Given a graph G and a pre-defined representation dimension d, the goal is to learn a mapping function f : G → Z, where Z ∈ R n×d and each row z i ∈ Z corresponds to the node v i 's representation.The effectiveness of f is evaluated by applying Z to different tasks, including node classification, link prediction, and community detection.Table 1 lists the mathematical notation used in the paper.Flat node representation learning.Prior to introducing the hierarchical message-passing mechanism, we first give a general review of existing Graph Neural Networks (GNNs) with flat message-passing.Let Â = ( Âuv ) u,v∈V , where Âuv is a normalised value of A uv .Thus, we can formally define -th layer of a flat GNN as: where Aggregate N (•) and Aggregate I (•) are two possibly differential parameterised functions.m a is aggregated message from node v's neighbourhood nodes (N (v)) with their structural coefficients, and m ( ) v is the residual message from node v after performing an adjustment operation to account for structural effects from its neighbourhood nodes.After, h ( ) v is the learned as representation vector of node v by with combining m ( ) a and m ( ) v , termed as Combine(•), at the -th iteration/layer.Note that, we initialise h (0) v = x v and the final learned representation vector after L iterations/layers Take the classic Graph Convolutional Network (GCN) [24] as an example, which applies two normalised mean aggregations to aggregate feature vectors node v's neighbourhood nodes N (v) and combine with itself: where |N (u)||N (v)| is a constant normalisation coefficient for the edge E uv , which is calculated from the normalised adjacent matrix D −1/2 AD −1/2 .D is the diagonal node degree matrix of A. W ( ) ∈ R n×d is a trainable weight matrix of layer .From Eq. 1 and Eq. 2, we can find that existing GNNs iteratively pass messages between adjacent nodes along observed edges, which will lead to two significant limitations: (a) the limited ability for information aggregation over long-range.They need to stack k layers to capture interactions within k steps for each node; (b) they are infeasible in encoding meso-and macro-level graph semantics.

Proposed Approach
We propose a framework, Hierarchical Message-passing Graph Neural Networks (HMGNNs), whose core idea is to use a hierarchical message-passing structure to enable node representations to receive long-range messages and multi-grained semantics from different levels.Fig. 2 provides an overview of the proposed framework, consisting of four components.First, we create a hierarchical structure to coarsen the input graph G gradually.Nodes at each level t of the hierarchy are grouped into different super nodes (s t 1 , . . ., s t n ).Second, we further organise level t generated super nodes into a super graph G t+1 at level t+1 based on the connections between nodes at level t, in order to enable message-passing that encodes the interactions between generated super nodes.Third, we develop three different propagation schemes to propagate messages among nodes within the same level and across different levels.At last, after obtaining node representations, we use the task-specific loss function and a gradient descent procedure to train the model.

Hierarchical Message-passing GNNs
I. Hierarchical structure generation.Nodes V of a graph G can be naturally organised by super node structures of T different levels, i.e., {V 1 , V 2 , . . ., V T }, in which densely inter-connected nodes of are grouped into a super node of V t .For example in Fig. 1 based on their institutes.Institutes can be further grouped into higher-level super nodes V 3 = {r 1 , r 2 , . . ., r 4 } according to research areas.Meanwhile, there is a relationship between nodes at different levels, as indicated by dashed lines in Fig. 1-(c).Hence, we can generate a hierarchical structure to depict the interand intra-relationships among authors, institutes, and research areas.We will discuss how to implement the hierarchical structure generation in Sec.4.2.
, where G 1 represents the original graph G. Given nodes at level t−1, i.e., V t−1 = {s t−1 1 , . . ., s t−1 m }, densely interconnected nodes of V t−1 are grouped into a super node of V t according to Sec. 4.1-I.We further create an edge between two super nodes s t i and s t j if there exist more than γ edges in G t−1 connecting elements in s t i and elements in s t j , where γ is a hyper-parameter and γ = 1 by default.In this way, we can have an alternative representation of the hierarchical structure as a list of (super) graphs H = {G 1 , . . ., G T }, where G 1 = G.Moreover, inter-level edges are created to depict the relationships between (super) nodes at different levels t and t−1, if a level t−1 node has a corresponding super node at level t, see for example Fig. 1-(c).We initialise the feature vectors of generated super nodes to be zero vectors with the same length as the original node feature vector x i .Taking the collaboration network in Fig. 1 as an example, at the micro-level (level 1), we have authors and their co-authorship relations; at the meso-level (level 2), we organise authors according to their affiliations and establish relations between institutes; at the macro-level (level 3), institutes are further grouped Hierarchical Message-Passing Graph Neural Networks according to their research areas, and we have the relations among the research areas.In addition, inter-level links are also created to depict the relationships between authors and institutes and between institutes and research areas.III.Hierarchical message propagation.The hierarchical message-passing mechanism works as a supplementary process to enhance the node representations with long-range interactions and multi-grained semantics.Thus it does not change the flat node representation learning process as described in Sec. 3, to ensure the local information is well maintained.And we adopt the classic GCN, as described in Eq. 2, as our default flat GNN encoder throughout the paper.Particularly, the hierarchical message-passing mechanism consists of -th layer consisting of 3 steps.
1. Bottom-up propagation.After obtaining node representations (h ( ) with -th flat information aggregation, we perform bottom-up propagation, i.e., NN-1 in Fig. 2-(b), using node representations in G t−1 to update node representations in G t (t ≥ 2) in the hierarchy H, as follows: where s t i is a super node in G t , and is the updated representation of s t i .2. Within-level propagation.We explore the typical flat GNN encoders [7,13,[24][25][26] to propagate information within each level's graph {G 1 , G 2 , . . ., G T }, i.e., NN-2 in Fig. 2-(c).The aim is to aggregate neighbours' information and update within-level node representations.Specifically, the information aggregation at level t is depicted as follows: where a u is the node representation of u after bottom-up propagation at the -th layer, N t (v) is a set of nodes adjacent to v at level t, and b is the aggregated node representation of v based on local neighbourhood information.Note that we adopt the classic GCN, as described in Eq. 2, as our default GNN encoder throughout the paper.We will discuss the possibility of incorporating with other advanced GNN encoders in Sec.5.3.3. Top-down propagation.The top-down propagation is illustrated by NN-at different levels can be different for other tasks.Hence, we adopt the attention mechanism [26] to adaptively learn the contribution weights of different levels during top-down integration, given by: where α uv is a trainable normalised attention coefficient between node v to super node u or itself, MEAN is an element-wise mean operation, C(v) denotes the set of different-level super nodes from level {2, . . ., K} that node v belongs to (|C(v)| = K − 1), and ReLU is the activation function.
H ( ) is the generated node representation of layer with h v ∈ H ( ) .We generate the output node representations of the last layer (L) via: where σ is the Euclidean normalisation function to reshape values into [0, 1].Z ∈ R n×d is the final generated node representation with each row vector z v ∈ Z.
IV. Model learning.The proposed HMGNNs could be trained in unsupervised, semi-supervised, or supervised settings.Here, we only discuss the supervised setting used for node classification in our experiments.We define the loss function based on cross entropy, as follows: where y v is a one-hot vector denoting the label of node v.We allow L to be customised for other task-specific objective functions, e.g., the negative log-likelihood loss [26].
We summarise the process of Hierarchical Message-passing Graph Neural Networks in Algorithm 1.Given a graph G, we first generate the hierarchical structure and combine it with the original graph G, to obtain H = {G t | t = 1, 2, . . ., T }, where G 1 = G (line 2).For each node, including original and generated super nodes, in each NN layer, we perform three primary operations in order: (1) bottom-up propagation (line 6), (2) within-level propagation (line 7), and (3) top-down propagation (line 9−15).After getting the representation vector of each node that is enhanced with informative long-range interactions and multi-grained semantics, and we train the model with the loss function L in Eq. 7.

Hierarchical Community-aware GNN
Identifying hierarchical super nodes for the proposed HMGNNs is the most crucial step as it determines how the information will be propagated within and between levels.We consider hierarchical network communities to construct

Algorithm 1 Hierarchical Message-passing Graph Neural Networks
Input: for t ← {2, . . ., T } do 6: a ( ) ), ∀v ∈ G t 8: end for for v ∈ G do 10: if < L then else 13: end for 16: end for the hierarchy.The network community has been proved helpful for assisting typical network analysis tasks, including node classification [21,22] and link prediction [33,34].Taking the algorithm efficiency into account and avoiding introducing additional hyper-parameters, i.e., the number of hierarchy levels, we adopt the well-known Louvain algorithm [23] to build the first implementation of HMGNNs, termed as Hierarchical Community-aware Graph Neural Network (HC-GNN).The Louvain algorithm returns us a hierarchical structure as described in Sec.4.1 without the need for a pre-defined number of hierarchies, based on which we can learn node representations involving longrange interactive information and multi-grained semantics.Due to page limit, we include more details about community detection algorithms in App. A.

Theoretical Analysis and Model Comparison
Long-range interactive capability.We now theoretically analyse the asymptotic complexity of different GNN models to capture long-range interaction.We first analyse flat GNN models, that they need to stack O(diam(G)) layers to ensure the communication between any pair of nodes in G.For HMGNNs, let us assume the pooling ratio λ = |V t+1 |/|V t |.Thus, the potentially total number of nodes in HMGNNs over G with n nodes is ∞ t=1 nλ t = O(n), while the number of possible levels is log λ −1 n = O(log n).That said, the shortest path between any two nodes of G is upper-bounded by O(log n).Compared to O(diam(G)) with flat GNNs, HMGNNs leads to significant improvement over the capability in capturing long-range interactions.
Model complexity.For the vanilla flat GNN model, i.e., GCN, its computational complexity of one layer is O(n 3 ) [24], and the computational complexity of a GCN model contains is O( n 3 ).For another attention-enhanced flat GNN model, i.e., Graph Attention Network (GAT) [26], except for the same convolutional operation as GCN, the additional masked attention over all nodes requires O( n 2 ) computational complexity [26].Thus, overall it takes O( (n 3 + n 2 )) complexity.For the hierarchical representation model, graph U-Net (g-U-Net) [15], its computational complexity of one hierarchy is O(2 n 3 ), because its unpooling operation introduces another O( n 3 ) complexity, in addition to the convolutional operations as GCN.Thus the complexity of g-U-Net with T levels is T t=1 2 (nλ t−1 ) 3 = O(2 n 3 ), since the pooled graphs are supposed have much smaller number of nodes than G.For HC-GNN, take GCN as an example GNN encoder and the Louvain algorithm as an example hierarchical structure construction method, which has optimal O(n log c) computational complexity [35], where c is the average degree.The top-down propagation allows each node of G to receive T different messages from T levels with different weights, this introduces O(T n) computational complexity, where T is the number of levels, and we assume T n.Altogether, the complexity of HC-GNN is which is more efficient than GAT and g-U-Net.

Experiments
We conduct extensive experiments to answer 6 research questions (RQ): • RQ1: How does HC-GNN performs vs. state-of-the-art methods for node classification (RQ1-1), community detection (RQ1-2), and link prediction (RQ1-3)?• RQ2: Can HC-GNN leads to satisfying performance under settings of transductive, inductive, and few-shot learning?• RQ3: How do different levels in the hierarchical structure contribute to the effectiveness of node representations?• RQ4: How do various hierarchical structure generation methods affect the performance of HC-GNN?• RQ5: Does HC-GNN survive from low sparsity of graphs?• RQ6: Does HC-GNN available with different encoders?

Evaluation Setup
Datasets.We perform experiments on both synthetic and real-world datasets.For the link prediction task, we adopt 3 datasets: • Grid [20].A synthetic 2D grid graph representing a 20 × 20 grid with |V| = 400 and no node features.• Cora [36].A citation network consists of 2, 708 scientific publications and 5, 429 links.A 1, 433 dimensional word vector describes each publication as a node feature.• Power [37].An electrical grid of western US with 4, 941 nodes and 6, 594 edges and no node features.
• Cora.The same above-mentioned Cora dataset contains 7 classes of nodes.Each node is labelled with the class it belongs to.• Citeseer [36].Each node comes with 3, 703-dimensional node features.
• Pubmed [39].A dataset consists of 19, 717 scientific publications from PubMed database about diabetes classified into one of 3 classes.Each node is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.
• PPI [40].24 protein-protein interaction networks and nodes of each graph comes with 50 dimensional feature vector.• Protein [41].1113 protein graphs and nodes of each graph comes with 29 dimensional feature vector.Each node is labelled with a functional role of the protein.
• Ogbn-arxiv [38].A large-scale citation graph between 169, 343 computer science arXiv papers.Each node is an arXiv paper, and each directed edge indicates that one paper cites another one.Each paper comes with a 128dimensional feature vector obtained by averaging the embeddings of words in its title and abstract.The task is to predict the 40 subject areas of these papers.
For node community detection, we use an email communication dataset: • Emails [42].7 real-world email communication graphs from SNAP with no node features.Each graph has 6 communities, and each node is labelled with the community it belongs to.
The data statistics of datasets is summarised in Table ,2 and they are available for download with our published code.Experimental settings.We evaluate HC-GNN under the settings of transductive and inductive learning.For node classification, we additionally conduct experiments with the few-shot setting.
• Transductive learning.For link prediction, we follow the experimental settings of [20] to use 10% existing links and an equal number of non-existent links as validation and test sets.The remaining 80% existing links and a dual number of non-existent links are used as the training set.For node classification, we follow the semi-supervised settings of [24]: if there are enough nodes, for each class, we randomly sample 20 nodes for training, 500 nodes for validation, and 1000 nodes for testing.For the Emails dataset, we follow the supervised learning settings of [27] to randomly select 80% nodes as the training set, and use the two halves of remaining as the validation and test set, respectively.We report the test performance when the best validation performance is achieved.• Inductive learning.This aims at examining a model's ability to transfer the learned knowledge from existing nodes to future ones that are newly connected to existing nodes in a graph.Hence, we hide the validation and testing graphs during training.We conduct the experiments for inductive learning using PPI and Protein datasets.We train models on 80% graphs to learn an embedding function f and apply it on the remaining 20% graphs to generate the representation of new-coming nodes.• Few-shot learning.Since the cost of collecting massive labelled datasets is high, having a few-shot learning model would be pretty valuable for practical applications.Few-shot learning can also be considered as an indicator to evaluate the robustness of a deep learning model.We perform few-shot node classification, in which only 5 samples of each class are used for training.The sampling strategies for testing and validation sets follow those in transductive learning.
Evaluation metrics.We adopt AUC to measure the performance of link prediction.For node classification, we use micro-and macro-average F1 scores and accuracy.NMI score is utilised for community detection evaluation.
Competing methods.To validate the effectiveness of HC-GNN, we compare it with 10 competing methods which include 6 flat message-passing GNN models, (GCN [24], GraphSAGE [25], GAT [26], GIN [7], P-GNNs [20], GCNII [13]), 3 hierarchical GNN models (HARP [31], g-U-Net [15], GXN [19]) and another state-of-the-art model.(GraphRNA [27]).For more details about competing methods, refer to App.B. Reproducibility.For fair comparison, all methods adopt the same representation dimension (d = 32), learning rate (= 1e−3), Adam optimiser and the number of iterations (= 200) with early stop (50).In terms of the neural network layers, we report the one with better performance of GCNII with better performance among {8, 16, 32, 64, 128}; for other models, we report the one with better performance between 2−4; For all models with hierarchical structure (including g-U-Net and HC-GNN), we use GCN as the default GNN encoder for fair comparision.Note that for the strong competitor, P-GNNs, since its representation dimension is related to the number of nodes in a graph, we add a linear regression layer at the end of P-GNNs for node classification tasks to ensure its end-to-end structure is the same as other models [27].For HC-GNN, the number of HC-GNN layers is varied and denoted as 1L, 2L or 3L.In Sec.5.3, HC-GNN adopts the number of layers leading to the best performance for model analysis i.e., 2L for the Cora dataset, 1L for the Citeseer and Pubmed datasets.For Louvain community detection, we use the implementation of a given package2 , which does not require any hyper-parameters.We use PyTorch Geometric to implement all models mentioned in this paper.More details are referred to our code file 3 .The experiments are repeated 10 times, and average results are reported.Note that we use only node features with unique one-hot identifiers to differentiate different nodes if there are no given node features from the datasets and use the original node features if they are available.We employ Pytorch to implement all models.Experiments were conducted with GPU (NVIDIA Tesla V100) machines.

Experimental Results
Table 3 Results in Micro-F1 and Macro-F1 for transductive semi-supervised node classification task.Results in Acc for node classification of Ogbn-arxiv follows the default settings of OGB dataset [38], and results in NMI for community detection (i.e., on the Emails data in the last column).Standard deviation errors are given.‡ indicates the results from OGB leaderboard [38].OOM: out-of-memory.1L: model with 1-layer GNN encoder for within-level propagation.
the hierarchical structure allows the model to capture informative long-range interactions of graphs, i.e., propagating messages from and to distant nodes in the graph; and (b) the meso-and macro-level semantics reflected by the hierarchy is encoded through bottom-up, within-level, and top-down propagations.On the other hand, P-GNNs, HARP, and GraphRNA perform worse in semi-supervised node classification.The possible reason is they need more training samples, such as using 80% of existing nodes as the training set, as described in their papers [20,27], but we have only 20 nodes for training in the semi-supervised setting.Inductive node classification (RQ1-1&RQ2).The results are reported in Table 4 4 .We can find that HC-GNN is still able to show some performance improvement over existing GNN models.But the improvement gain is not so significant and inconsistent in different layers of HC-GNN compared to the results in transductive learning.The possible reason is that different graphs may have other hierarchical community structures.Nevertheless, the results lead to one observation: the effect of transferring hierarchical semantics between graphs for inductive node classification is somewhat limited.Therefore, exploring an ameliorated model that can adaptively exploit hierarchical structure for different graphs for different tasks would be interesting.We further discuss it in Sec.6 as one concluding remark.Few-shot node classification (RQ1-1&RQ2).Table 5 demonstrates better performance in few-shot learning than all competing methods across 3 datasets.Such results indicate that the hierarchical message passing is able to transfer supervised information through inter-and intra-level propagations.In addition, the hierarchical message-passing pipeline further enlarges the influence range of supervision information from a small number of training samples.With effective and efficient pathways to broadcast information, HC-GNN is proven to be quite promising in few-shot learning.
Community detection (RQ1-2).The community detection results conducted on the Emails dataset are also shown in Table 3.It can be seen that HC-GNN again outperforms all competing methods.We believe this is because the communities identified by Louvain are further exploited by learning their hierarchical interactions in HC-GNN.In other words, HC-GNN is able to reinforce the intra-and inter-community effect and encode it into node representations.Link prediction (RQ1-3).Here, we motivate our idea by considering pairwise relation prediction between nodes.Suppose a pair of nodes u, v are labelled with label y, and our goal is to predict y for unseen pairs.From the perspective of representation learning, we can solve the problem via learning an embedding function f that computes the node representation z v , where the objective is to maximise the likelihood of distribution p(y|z u , z v ).The results in Table 6 indicate that the HC-GNN leads to competitive performance compared to all competing methods, with up to 11.7% AUC improvement, demonstrating its effectiveness on link prediction tasks.When node features are accessible (i.e., Cora-Feat and Power-Feat), all models perform relatively well, and g-U-Net has slightly better performance on Cora-Feat dataset.Because node features provide meaningful information to predict pairwise relations.Another interesting perspective is investigating the models' performance without contextual node features (e.g., Grid, Cora-NoFeat and Power-NoFeat).It is surprising that HC-GNN variants show great superiority in these three datasets.We argue that when only topological information is available, the hierarchical semantics introduced by HC-GNN helps find missing links.

Contribution of different levels (RQ3).
Since HC-GNN highly relies on the generated hierarchical structure, we aim to examine how different levels in the hierarchy contribute to the prediction.We report the transductive semisupervised node classification performance by varying the number of levels (from 1T to 4T ).GCN is also selected for comparison because it considers no hierarchy, i.e., only within-level propagation in the original graph.The results are shown in Fig. 3(a), in which 1T and 2T indicate only the first hierarchy level and the first 2 hierarchy levels are adopted, respectively.We can find that HC-GNN using more levels for hierarchy construction lead to better results.The flat message passing of GCN cannot work well.Such results provide strong evidence that GNNs can significantly benefit from the hierarchical messagepassing mechanism.In addition, more hierarchical semantics can be encoded if more levels are adopted.
Influence of hierarchy generation approaches (RQ4).HC-GNN implements the proposed Hierarchical Message-passing Graph Neural Networks based on the Louvain community detection algorithm, that is termed HC-GNN-Louvain in this paragraph.We aim to validate (A) whether the community information truly benefits the classification tasks, and (B) how different approaches to generate the hierarchical structure affect the performance.
To answer (A), we construct a random hierarchical structure to generate randomised HC-GNN, termed HC-GNN-Random, in which Louvain detects hierarchical communities, and nodes are randomly swapped among the samelevel communities.In other words, the hierarchy structure is maintained, but community memberships are perturbed.The results on semi-supervised node classification are exhibited in Fig. 3(b).We can see that HC-GNN-Random works worse than GCN in Cora and Pudmed, and much worse than HC-GNN-Louvain.It implies that hierarchical communities generated from the graph topology genuinely lead to a positive effect on information propagation.Meanwhile, it is surprisingly found that HC-GNN-Random achieves better performance than GCN on Citeseer.We argue this is because HC-GNN-Random has the ability to spread supervision information in the hierarchy structure, leading to the occasional improvement.To answer (B), we utilise Girvan Newman [43] to produce the hierarchical structure by following the same way described in Sec.4.1, and have a model named HC-GNN-Girvan Newman.
The results are shown in Fig. 3(b).Although HC-GNN-Girvan Newman is not as effective as HC-GNN-Louvain, they still outperform GCN.Such a result indicates that the approaches to generate the hierarchical structure will influence the capability of HC-GNN.While HC-GNN-Louvain leads to promising performance, one can search for a proper hierarchical community detection method to perform better on different tasks.Influence of graph sparsity (RQ5).Since community detection algorithms are sensitive to the sparsity of the graph [44], we aim at studying how HC-GNN perform under graphs with low sparsity values in the task of semi-supervised node classification.We consider two kinds of sparsity: one is graph sparsity by randomly removing a percentage of edges from all edges in the graph, i.e., 10% − 50%; the other is node sparsity by randomly drawing a portion of edges incident to every node in the graph.The random removal of edges can be considered that users hide partial connections due to privacy concerns.The results for Cora and Citeseer are presented in Fig. 4. HC-GNN significantly outperforms the competing methods on graph sparsity and node sparsity under different edge-removal percentages.Such results prove that even though communities are subject to sparse graphs, but it will not damage HC-GNN's performance making it worse than other competing models.Ablation study of different primary GNN encoders (RQ6).We adopted GCN as the default primary GNN encoder in model presentation (Sec.4) and previous experiments.Here, we present more experimental results by endowing HC-GNN with advanced GNN encoders in Table 7.The table demonstrates that advanced GNN encoders can still benefit from the multigrained semantics of HC-GNN.For instance, GCNII can stack lots of layers to capture long-range information; however, it still follows a flat message-passing mechanism hence naturally ignoring the multi-grained semantics.HC-GNN further ameliorates this problem for better performance.

Conclusion and Future Work
This paper has presented a novel Hierarchical Message-passing Graph Neural Networks (HMGNNs) framework, which deals with two critical deficiencies of the flat message passing mechanism in existing GNN models, i.e., the limited ability for information aggregation over long-range and infeasible in encoding meso-and macro-level graph semantics.Following this innovative idea, we further presented the first implementation, Hierarchical Community-aware Graph Neural Network (HC-GNN), with the assistance of a hierarchical communities detection algorithm.The theoretical analysis confirms HC-GNN's significant ability in capturing long-range interactions without introducing heavy computation complexity.Extensive experiments conducted on 9 datasets show that HC-GNN can consistently outperform state-of-the-art GNN models in 3 tasks, including node classification, link prediction, and community detection, under settings of transductive, inductive, and few-shot learning.Furthermore, the proposed hierarchical message-passing GNN provides model flexibility.For instance, it friendly allows different choices and customised designs of the hierarchical structure, and it incorporates well with advanced flat GNN encoders to obtain more impressive results.That said, the HMGNNs could be easily applied to work as a general practical framework to boost downstream tasks with arbitrary hierarchical structure and encoder.The proposed hierarchical message-passing GNNs provide a good starting point for exploiting graph hierarchy with GNN models.In the future, we aim to incorporate the learning of the hierarchical structure into the model optimisation of GNNs such that a better hierarchy can be searched on the fly.Moreover, it is also interesting to extend our framework for heterogeneous networks.

Fig. 1
Fig. 1 Elaboration of the proposed hierarchical message passing: (a) a collaboration network, (b) an illustration of hierarchical message-passing mechanism based on (a) and (c), and (c) an example of the identified hierarchical structure.

3 𝛼𝛼4Fig. 2
Fig. 2 (a) The architecture of Hierarchical Message-passing Graph Neural Networks: we first generate a hierarchical structure, in which each level is formed as a super graph, use the level t graph to update nodes of level t + 1 graph (bottom-up propagation), apply the typical neighbour aggregation on each level's graph (within-level propagation), use the generated node representations from level 2 ≤ t ≤ T to update node representations at the level 1 (topdown propagation), and optimises the model via a task-specific loss.(b) NN-1: bottom-up propagation.(c) NN-2: within-level propagation.(d) NN-3: top-down propagation.

Fig. 4
Fig.4Results on semi-supervised node classification in graphs by varying the percentage of removed edges.

Table 1
Summary of main notations.
the set of nodes and edges on G, respectively A the adjacent matrix of G X ∈ R n×π the matrix of node features d the pre-defined representation dimension H ∈ R n×d the hidden node representation matrix hv ∈ R d the hidden node representation of node v Z ∈ R n×d the final node representation matrix zv ∈ R d the final node representation of node v L the number of layers of within-level propagation GNN encoder T the number of hierarchy levels Gt the super graph at level t s t n the n-th super node of Gt at level t H the set of constructed super graphs N (v) the set of neighbour nodes of node v γ a hyper-parameter that used to construct super graph Gt λ the pooling ratio

Table 2
Summary of dataset statistics.LP: Link Prediction, NC: Node Classification, CD: Community Detection, N.A. means a dataset does not contain node features or node labels.

Table 4
Micro-F1 results for inductive node classification.Standard deviation errors are given.1L: model with 1-layer GNN encoder for within-level propagation.

Table 5
Micro-F1 results for few-shot node classification.Standard deviation errors are given.1L: model with 1-layer GNN encoder for within-level propagation.

Table 6
Results in AUC for link prediction.Standard deviation errors are given.1L: model with 1-layer GNN encoder for within-level propagation.

Table 7
Comparison of HC-GNN with different primary GNN encoders (within-level propagation), follow the transductive node classification settings.Reported results in Micro-F1.