1 Introduction

Graphs are a ubiquitous data structure that models objects and their relationships within complex systems, such as social networks, biological networks, recommendation systems, etc Wu et al. (2021). Learning node representation from a large graph has been proved as a useful approach for a wide variety of network analysis tasks, including link prediction Zhang and Chen (2018), node classification Zitnik et al. (2018) and community detection Chen et al. (2019).

Graph Neural Networks (GNNs) are currently one of the most promising paradigms to learn and exploit node representations due to their effective ability to encode node features and graph topology in transductive, inductive, and few-shot settings Zhang et al. (2020). Many existing GNN models follow a similar flat message-passing principle where information is iteratively passed between adjacent nodes along observed edges. Such a paradigm is able to incorporate local information surrounded by each node Gilmer et al. (2017). However, it has been proven to suffer from several drawbacks (Xu et al. 2019; Min et al. 2020; Li et al. 2020).

Among these deficiencies of flat message-passing GNNs, the limited ability for information aggregation over long-range has attracted significant attention Li et al. (2018), since most graph-related tasks require the interactions between nodes that are not directly connected Alon and Yahav (2021). That said, flat message-passing GNNs struggle in capturing dependencies between distant node pairs. Inspired by the outstanding effectiveness of very deep neural network models has been demonstrated in computer vision and natural language processing domains LeCun et al. (2015), a natural solution is stacking lots of GNN layers together to directly increase the receptive field of each node. Consequently, deeper models have been proposed by simplifying the aggregation design of GNNs and accompanied by well-designed normalisation units or specific gradient descent method (Chen et al. 2020; Gu et al. 2020). Nevertheless, Alon and Yahav have theoretically shown that flat GNNs are susceptible to being a bottleneck when aggregating messages across a long path and lead to severe over-squashing issues Alon and Yahav (2021).

Fig. 1
figure 1

Elaboration of the proposed hierarchical message passing: a a collaboration network, b an illustration of hierarchical message-passing mechanism based on (a) and (c), and c an example of the identified hierarchical structure

On the other hand, in this paper, we further argue another crucial deficiency of flat message-passing GNNs is that they rely on only aggregating messages across the observed topological structure. The hierarchical semantics behind the graph structure provides useful information and should be incorporated into the learning of node representations. Taking the collaboration network in Fig. 1a as an example; author nodes highlighted in light yellow come from the same institutes, and nodes filled with different colours indicate authors in various research areas. In order to generate the node representation of a given author, existing GNNs mainly capture the co-author level information depending on the explicit graph structure. However, information hidden at meso and macro levels is neglected. In the example of Fig. 1, meso-level information means authors belong to the same institutes and their connections to adjacent institutes. Macro-level information refers to authors of the same research areas and their relationship with related research areas. Both meso- and macro-level knowledge cannot be directly modelled through flat message passing via observed edges.

In this paper, we investigate the idea of a hierarchical message-passing mechanism to enhance the information aggregation pipeline of GNNs. The ultimate goal is to make the node representation learning process aware of both long-range interactive information and implicit multi-resolution semantics within the graph.

We note that a few graph pooling approaches have recently delivered various attempts to use the hierarchical structure idea (Gao and Ji 2019; Ying et al. 2018; Huang et al. 2019; Ranjan et al. 2020; Li et al. 2020). g-U-Net Gao and Ji (2019) and GXN Li et al. (2020) employ a bottom-up and top-down pooling operation; however, they do not allow long-range message-passing. DiffPool Ying et al. (2018), AttPool Huang et al. (2019) and ASAP Ranjan et al. (2020) target at graph classification tasks instead of enabling node representations to capture long-range dependencies and multi-grained semantics of one graph. Moreover, P-GNNs You et al. (2019) create a different information aggregation mechanism that utilises sampled anchor nodes to impose topological position information into learning node representations. While P-GNNs can capture global information, the hierarchical semantics mentioned above is still overlooked, and the global message-passing is not realised. Besides, the anchor-set sampling process is time-consuming for large graphs, and it cannot work well under the inductive setting.

Specifically, we present a novel framework, Hierarchical Message-passing Graph Neural Networks (HMGNNs), elaborated in Fig. 1. In detail, HMGNNs can be organised into the following four phases.

  1. (i)

    Hierarchical structure generation. To overcome long-distance obstacles in the process of GNN message-passing, we propose to use a hierarchical structure to reduce the size of graph \(\mathcal {G}\) gradually, where nodes at each level t are integrated into different super nodes (\(s^{t+1}_{1}, \dots , s^{t+1}_{n}\)) at each level \(t\!+\!1\).

  2. (ii)

    t-level super graph construction. In order to allow the message passing among generated same-level super nodes, we construct a super graph \(\mathcal {G}_t\) based on the connections between nodes at its lower level \(t\!-\!1\).

  3. (iii)

    Hierarchical message propagation. With the generated hierarchical structure for a given graph, we develop three propagation manners, including bottom-up, within-level and top-down.

  4. (iv)

    Model learning. Last, we leverage task-specific loss functions and a gradient descent procedure to train the model.

Designing a feasible hierarchical structure is crucial for HMGNNs, as the hierarchical structure determines how messages can be passed through different levels and what kind of meso- and macro-level information to be encoded in node representations. In this paper, we consider (but are not restricted to) network communities. As a natural graph property, the community has been proved very useful for many graph mining tasks (Wang et al. 2014, 2017). Lots of community detection methods can generate hierarchical community structures. Here, we propose an implementation model for the proposed framework, Hierarchical Community-aware Graph Neural Network (HC-GNN). HC-GNN exploits a well-known hierarchical community detection method, i.e., the Louvain method Blondel et al. (2008) to build up the hierarchical structure, which is then used for the hierarchical message-passing mechanism.

The theoretical analysis illustrates HC-GNN’s remarkable capacity in capturing long-range information without introducing heavy additional computation complexity. Extensive empirical experiments are conducted on 9 graph datasets to reveal the performance of HC-GNN on a variety of tasks, i.e., link prediction, node classification, and community detection, under transductive, inductive and few-shot settings. The results show that HC-GNN consistently outperforms a set of state-of-the-art approaches for link prediction and node classification. In the few-shot learning setting, where only 5 samples of each label are used to train the model, HC-GNN achieves a significant performance improvement, up to \(16.4\%\). We also deliver a few empirical insights: (a) the lowest level contributes most to node representations; (b) how to generate the hierarchical structure has a significant impact on the quality of node representations; (c) HC-GNN maintains an outstanding performance for graphs with different levels of sparsity perturbation; (d) HC-GNN possess significant flexibility in incorporating different GNN encoders, which means HC-GNN can achieve superior performance with advanced flat GNN encoders.

Contributions The contribution of this paper is five-fold:

  1. 1.

    We propose a novel Hierarchical Message-passing Graph Neural Networks framework, which allows nodes to conveniently capture informative long-range interactions and encode multi-grained semantics hidden behind the given graph.

  2. 2.

    We present the first implementation of our framework, namely HC-GNNFootnote 1, by detecting and utilising hierarchical community structures for message passing.

  3. 3.

    Theoretical analysis demonstrate the efficiency and the capacity of HC-GNN in capturing long-range interactions in graphs.

  4. 4.

    Experimental results show that HC-GNN significantly outperforms competing GNN methods on several prediction tasks under transductive, inductive, and few-shot settings.

  5. 5.

    Further empirical analysis is conducted to derive insights into the impact of the hierarchical structure and graph sparsity on HC-GNN and confirm its flexibility in incorporating different GNN encoders.

The rest of this paper is organised as follows. We begin by briefly reviewing additional related work in Sect. 2. Then in Sect. 3, we introduce the preliminaries of this study and state the research problem. In Sect. 4, we introduce our proposed framework Hierarchical Message-passing Graph Neural Networks and its first implementation, HC-GNN. Experimental results and empirical analysis are shown in Sect. 5. Finally, we conclude the paper and discuss the future work in Sect. 6.

2 Related work

Flat message-passing GNNs They perform graph convolution, directly aggregate node features from neighbours in the given graph, and stack multiple GNN layers to capture long-range node dependencies (Kipf and Welling 2017; Hamilton et al. 2017; Velickovic et al. 2018; Xu et al. 2019). However, they were observed not to benefit from more than a few layers, and recent studies have theoretically expressed this problem as over-smoothing (Li et al. 2018; Alon and Yahav 2021), i.e., node representations become indistinguishable when the number of GNN layers increases. On the other hand, GraphRNA (Huang et al. 2019) presents graph recurrent networks to capture interactions between far-away nodes. Still, we cannot apply it to inductive learning settings because they rely on attributed random walks and the recurrent aggregations introduce high computation costs. P-GNNs (You et al. 2019) incorporate a novel global information aggregation mechanism based on the distance of a given target node to each anchor set. However, P-GNNs sacrifice the ability of existing GNNs on inductive node-wise tasks. As shown in their paper, they only support pairwise node classification tasks, i.e., comparing if two nodes have the same class label instead of predicting the class label of each individual node. Additionally, the anchor-set sampling operation brings a high computational cost for large-size graphs. Recently, deeper flat GNNs have been proposed by simplifying the aggregation design of GNNs and accompanied by well-designed normalisation units (Chen et al. 2020) or specific gradient descent methods  (Gu et al. 2020). Nevertheless, Alon and Yahav (2021) has theoretically shown that flat GNNs are susceptible to being a bottleneck when aggregating messages across a long path and lead to severe over-squashing issues. Moreover, we will theoretically discuss the advantages of our method compared with flat GNNs in Sect. 4.3, in terms of long-range interactive capability and complexity.

Hierarchical representation GNNs In recent years, some studies generalise the pooling mechanism of computer vision Ronneberger et al. (2015) to GNNs for graph representation learning (Ying et al. 2018; Huang et al. 2019; Gao and Ji 2019; Ranjan et al. 2020; Li et al. 2020, 2020; Rampásek and Wolf 2021). However, most of them, such as DiffPool Ying et al. (2018), AttPool Huang et al. (2019) and ASAP Ranjan et al. (2020), are designed for graph classification tasks rather than learning node representations to capture long-range dependencies and multi-resolution semantics. Thus they cannot be directly applied to node-level tasks. g-U-Net Gao and Ji (2019) defines a similarity-based pooling operator to construct the hierarchical structure, and GXN Li et al. (2020) designs another infomax pooling operator, they implement bottom-up and top-down operations. Despite the success of g-U-Net and GXN in producing graph-level representations, they cannot model the multi-grained semantics and realise long-range message-passing. HARP Chen et al. (2018) and LouvainNE Bhowmick et al. (2020) are two unsupervised network representation approaches that adopt a hierarchical structure, but they do not support the supervised training paradigm to optimise for specific tasks, and they cannot be applied with inductive settings.

More recently, HGNet Rampásek and Wolf (2021) leverages multi-resolution representations of a graph to facilitate capturing long-range interactions. Below, we discuss the main differences between HGNet and HC-GNN. HC-GNN designs different efficient and effective bottom-up and top-down propagation mechanisms to realise elegant hierarchical message-passing rather than directly applying pooling and relational GCN, respectively. We further provide the theoretical analysis to demonstrate the efficiency and capacity of HC-GNN, such analysis has not been performed on HGNet. We also provide a much more careful and comprehensive set of experimental studies to validate the effectiveness of HC-GNN, including comparing learning settings on node classification (transductive, inductive, and few-sot), comparing to more recent competing flat GNN methods, comparing to state-of-the-art hierarchical GNN models, evaluating on the link prediction task, and in-depth analysis on graph sparsity and primary GNN encoders (Sect. 5). Last but not least, in addition to capturing long-range interactions, we further deeply discuss the benefits and the usefulness of the hidden hierarchical structure in a graph.

Table 8 summarises the critical advantages of the proposed HC-GNN and compares it with a number of state-of-the-art methods published recently. We are the first to present the hierarchical message passing to efficiently model long-range informative interaction and multi-grained semantics. In addition, our HC-GNN can utilise the community structures and be applied for transductive, inductive and few-shot inferences.

3 Problem statement

An attributed graph with n nodes can be represented as \(\mathcal {G}=(\mathcal {V}, \mathcal {E}, \textbf{X})\), where \(\mathcal {V}=\{v_{1}, v_{2},\dots , v_{n}\}\) is the node set, \(\mathcal {E} \subseteq \mathcal {V} \times \mathcal {V}\) denotes the set of edges, and \(\textbf{X}=\{\textbf{x}_{1}, \textbf{x}_{2}, \dots , \textbf{x}_{n}\}\in \mathbb {R}^{n \times \pi }\) is the feature matrix, in which each vector \(\textbf{x}_i\in \textbf{X}\) is the feature vector associated with node \(v_{i}\), and \(\pi \) is the dimension of input feature vector of each node. For subsequent discussion, we summarise \(\mathcal {V}\) and \(\mathcal {E}\) into an adjacency matrix \(\textbf{A} \in \{0, 1\}^{n \times n}\).

Problem definition Given a graph \(\mathcal {G}\) and a pre-defined representation dimension d, the goal is to learn a mapping function \(f:\! \mathcal {G} \rightarrow \textbf{Z}\), where \(\textbf{Z} \in \mathbb {R}^{n \times d}\) and each row \(\textbf{z}_{i}\in \textbf{Z}\) corresponds to the node \(v_i\)’s representation. The effectiveness of f is evaluated by applying \(\textbf{Z}\) to different tasks, including node classification, link prediction, and community detection. Table 1 lists the mathematical notation used in the paper.

Table 1 Summary of main notations

Flat node representation learning Prior to introducing the hierarchical message-passing mechanism, we first give a general review of existing Graph Neural Networks (GNNs) with flat message-passing. Let \(\hat{\textbf{A}} = (\hat{\textbf{A}}_{uv})_{u,v \in \mathcal {V}}\), where \(\hat{\textbf{A}}_{uv}\) is a normalised value of \(\textbf{A}_{uv}\). Thus, we can formally define \(\ell \)-th layer of a flat GNN as:

$$\begin{aligned} \begin{aligned} \textbf{m}^{(\ell )}_a&= \textsc {Aggregate}^{N}(\{\hat{\textbf{A}}_{uv}, \, \textbf{h}^{(\ell -1)}_u \, \vert \, u \in \mathcal {N}(v) \}), \\ \textbf{m}^{(\ell )}_v&= \textsc {Aggregate}^{I}(\{\hat{\textbf{A}}_{uv} \, \vert \, u \in \mathcal {N}(v) \}) \, \textbf{h}^{(\ell -1)}_v, \\ \textbf{h}^{(\ell )}_v&= \textsc {Combine}(\textbf{m}^{(\ell )}_a, \textbf{m}^{(\ell )}_v) \end{aligned} \end{aligned}$$
(1)

where \(\textsc {Aggregate}^{N}(\cdot )\) and \(\textsc {Aggregate}^{I}(\cdot )\) are two possibly differential parameterised functions. \(\textbf{m}^{(\ell )}_a\) is aggregated message from node v’s neighbourhood nodes (\(\mathcal {N}(v)\)) with their structural coefficients, and \(\textbf{m}^{(\ell )}_v\) is the residual message from node v after performing an adjustment operation to account for structural effects from its neighbourhood nodes. After, \(\textbf{h}^{(\ell )}_v\) is the learned as representation vector of node v by with combining \(\textbf{m}^{(\ell )}_a\) and \(\textbf{m}^{(\ell )}_v\), termed as \(\textsc {Combine}(\cdot )\), at the \(\ell \)-th iteration/layer. Note that, we initialise \(\textbf{h}^{(0)}_v = \textbf{x}_v\) and the final learned representation vector after L iterations/layers \(\textbf{z}_v = \textbf{h}^{(L)}_v\).

Take the classic Graph Convolutional Network (GCN) Kipf and Welling (2017) as an example, which applies two normalised mean aggregations to aggregate feature vectors node v’s neighbourhood nodes \(\mathcal {N}(v)\) and combine with itself:

$$\begin{aligned} \textbf{h}^{(\ell )}_v = \textrm{ReLU}\left( \sum \limits _{u \in \mathcal {N}(v)}\frac{\textbf{W}^{(\ell )} \textbf{h}^{(\ell -1)}_u}{\sqrt{\vert \mathcal {N}(u) \vert \vert \mathcal {N}(v) \vert }} + \frac{\textbf{W}^{(\ell )}\textbf{h}^{(\ell -1)}_v}{\sqrt{\vert \mathcal {N}(v) \vert \vert \mathcal {N}(v) \vert }}\right) \end{aligned}$$
(2)

where \(\sqrt{\vert \mathcal {N}(u) \vert \vert \mathcal {N}(v) \vert }\) is a constant normalisation coefficient for the edge \(\mathcal {E}_{uv}\), which is calculated from the normalised adjacent matrix \(\textbf{D}^{-1/2}\textbf{A}\mathbf {D^{-1/2}}\). \(\textbf{D}\) is the diagonal node degree matrix of \(\textbf{A}\). \(\textbf{W}^{(\ell )} \in \mathbb {R}^{n \times d}\) is a trainable weight matrix of layer \(\ell \). From Eqs. 1 and 2, we can find that existing GNNs iteratively pass messages between adjacent nodes along observed edges, which will lead to two significant limitations: (a) the limited ability for information aggregation over long-range. They need to stack k layers to capture interactions within k steps for each node; (b) they are infeasible in encoding meso- and macro-level graph semantics.

4 Proposed approach

We propose a framework, Hierarchical Message-passing Graph Neural Networks (HMGNNs), whose core idea is to use a hierarchical message-passing structure to enable node representations to receive long-range messages and multi-grained semantics from different levels. Fig. 2 provides an overview of the proposed framework, consisting of four components. First, we create a hierarchical structure to coarsen the input graph \(\mathcal {G}\) gradually. Nodes at each level t of the hierarchy are grouped into different super nodes (\(s^{t}_{1}, \dots , s^{t}_{n}\)). Second, we further organise level t generated super nodes into a super graph \(\mathcal {G}_{t+1}\) at level \(t\!+\!1\) based on the connections between nodes at level t, in order to enable message-passing that encodes the interactions between generated super nodes. Third, we develop three different propagation schemes to propagate messages among nodes within the same level and across different levels. At last, after obtaining node representations, we use the task-specific loss function and a gradient descent procedure to train the model.

Fig. 2
figure 2

a The architecture of Hierarchical Message-passing Graph Neural Networks: we first generate a hierarchical structure, in which each level is formed as a super graph, use the level t graph to update nodes of level \(t+1\) graph (bottom-up propagation), apply the typical neighbour aggregation on each level’s graph (within-level propagation), use the generated node representations from level \(2\le t\le T\) to update node representations at the level 1 (top-down propagation), and optimises the model via a task-specific loss. b NN-1: bottom-up propagation. c NN-2: within-level propagation. d NN-3: top-down propagation

4.1 Hierarchical message-passing GNNs

I. Hierarchical structure generation Nodes \(\mathcal {V}\) of a graph \(\mathcal {G}\) can be naturally organised by super node structures of T different levels, i.e., \(\{\mathcal {V}_1, \mathcal {V}_2, \dots , \mathcal {V}_T \}\), in which densely inter-connected nodes of \(\mathcal {V}_{t\!-\!1}\) (\(2\le t\le T\)) are grouped into a super node of \(\mathcal {V}_{t}\). For example in Fig. 1a, author set \(\mathcal {V}_1=\{v_{1}, v_{2}, \dots , v_{17}\}\) can be grouped into different super nodes \(\mathcal {V}_2=\{s_{1}, s_{2}, \dots , s_{9}\}\) based on their institutes. Institutes can be further grouped into higher-level super nodes \(\mathcal {V}_3=\{r_{1}, r_{2}, \dots , r_{4}\}\) according to research areas. Meanwhile, there is a relationship between nodes at different levels, as indicated by dashed lines in Fig. 1c. Hence, we can generate a hierarchical structure to depict the inter- and intra-relationships among authors, institutes, and research areas. We will discuss how to implement the hierarchical structure generation in Sect. 4.2.

II. t-Level super graph construction The level t’s super graph \(\mathcal {G}_{t}\) is constructed based on level \(t\!-\!1\) graph \(\mathcal {G}_{t\!-\!1}\) (\(t\ge 2\)), where \(\mathcal {G}_1\) represents the original graph \(\mathcal {G}\). Given nodes at level \(t\!-\!1\), i.e., \(\mathcal {V}_{t\!-\!1}=\{s^{t\!-\!1}_{1}, \dots , s^{t\!-\!1}_{m}\}\), densely inter-connected nodes of \(\mathcal {V}_{t\!-\!1}\) are grouped into a super node of \(\mathcal {V}_{t}\) according to Sect. 4.1-I. We further create an edge between two super nodes \(s^{t}_{i}\) and \(s^{t}_{j}\) if there exist more than \(\gamma \) edges in \(\mathcal {G}_{t\!-\!1}\) connecting elements in \(s^{t}_{i}\) and elements in \(s^{t}_{j}\), where \(\gamma \) is a hyper-parameter and \(\gamma =1\) by default. In this way, we can have an alternative representation of the hierarchical structure as a list of (super) graphs \(\mathcal {H}=\{\mathcal {G}_{1}, \dots , \mathcal {G}_{T}\}\), where \(\mathcal {G}_{1} = \mathcal {G}\). Moreover, inter-level edges are created to depict the relationships between (super) nodes at different levels t and \(t\!-\!1\), if a level \(t\!-\!1\) node has a corresponding super node at level t, see for example Fig. 1c. We initialise the feature vectors of generated super nodes to be zero vectors with the same length as the original node feature vector \(\textbf{x}_i\). Taking the collaboration network in Fig. 1 as an example, at the micro-level (level 1), we have authors and their co-authorship relations; at the meso-level (level 2), we organise authors according to their affiliations and establish relations between institutes; at the macro-level (level 3), institutes are further grouped according to their research areas, and we have the relations among the research areas. In addition, inter-level links are also created to depict the relationships between authors and institutes and between institutes and research areas.

III. Hierarchical message propagation The hierarchical message-passing mechanism works as a supplementary process to enhance the node representations with long-range interactions and multi-grained semantics. Thus it does not change the flat node representation learning process as described in Sect. 3, to ensure the local information is well maintained. And we adopt the classic GCN, as described in Eq. 2, as our default flat GNN encoder throughout the paper. Particularly, the hierarchical message-passing mechanism consists of \(\ell \)-th layer consisting of 3 steps.

  1. 1.

    Bottom-up propagation. After obtaining node representations (\(\textbf{h}^{(\ell )}_{s^{t\!-\!1}}\)) of \(\mathcal {G}_{t\!-\!1}\) with \(\ell \)-th flat information aggregation, we perform bottom-up propagation, i.e., NN-1 in Fig. 2b, using node representations in \(\mathcal {G}_{t\!-\!1}\) to update node representations in \(\mathcal {G}_{t}\) (\(t\ge 2\)) in the hierarchy \(\mathcal {H}\), as follows:

    $$\begin{aligned} \textbf{a}^{(\ell )}_{s^{t}_i} = \frac{1}{\vert {s^{t}_i}\vert +1} \left( \sum _{s^{t\!-\!1} \in s^{t}_i} \textbf{h}^{(\ell )}_{s^{t\!-\!1}} + \textbf{h}^{(\ell \!-\!1)}_{s^{t}_i}\right) \end{aligned}$$
    (3)

    where \(s^{t}_i\) is a super node in \(\mathcal {G}_{t}\), and \(s^{t\!-\!1}\) is a node in \(\mathcal {G}_{t\!-\!1}\) that belongs to \(s^{t}_i\) in \(\mathcal {G}_{t}\). \(\textbf{h}^{(\ell \!-\!1)}_{s^{t}_i}\) is the node representation of \(s^{t}_i\) that generated by layer \(\ell \!-\! 1\) in graph \(\mathcal {G}_{t}\), \(\vert {s^{t}_i} \vert \) is the number of nodes of level \(t\!-\!1\) that belonging to super node \(s^{t}_i\), and \(\textbf{a}^{(\ell )}_{s^{t}_i}\) is the updated representation of \(s^{t}_i\).

  2. 2.

    Within-level propagation. We explore the typical flat GNN encoders (Kipf and Welling 2017; Hamilton et al. 2017; Velickovic et al. 2018; Xu et al. 2019; Chen et al. 2020) to propagate information within each level’s graph \(\{\mathcal {G}_{1}, \mathcal {G}_{2},\dots , \mathcal {G}_{T} \}\), i.e., NN-2 in Fig. 2c. The aim is to aggregate neighbours’ information and update within-level node representations. Specifically, the information aggregation at level t is depicted as follows:

    $$\begin{aligned} \begin{aligned} \textbf{m}^{(\ell )}_a&= \textsc {Aggregate}^{N}(\{\hat{\textbf{A}}^{t}_{uv}, \, \textbf{a}^{(\ell )}_u \, \vert \, u \in \mathcal {N}^{t}(v) \}), \\ \textbf{m}^{(\ell )}_v&= \textsc {Aggregate}^{I}(\{\hat{\textbf{A}}^{t}_{uv} \, \vert \, u \in \mathcal {N}^{t}(v) \}) \, \textbf{a}^{(\ell )}_{v}, \\ \textbf{b}^{(\ell )}_v&= \textsc {Combine}(\textbf{m}^{(\ell )}_a, \textbf{m}^{(\ell )}_v) \\ \end{aligned} \end{aligned}$$
    (4)

    where \(\textbf{a}^{(\ell )}_{u}\) is the node representation of u after bottom-up propagation at the \(\ell \)-th layer, \(\mathcal {N}^{t}(v)\) is a set of nodes adjacent to v at level t, and \(\textbf{b}^{(\ell )}_{v}\) is the aggregated node representation of v based on local neighbourhood information. Note that we adopt the classic GCN, as described in Eq. 2, as our default GNN encoder throughout the paper. We will discuss the possibility of incorporating with other advanced GNN encoders in Sect. 5.3.

  3. 3.

    Top-down propagation. The top-down propagation is illustrated by NN-3 in Fig. 2d. We use node representations in \(\{\mathcal {G}_{2}, \dots , \mathcal {G}_{T}\}\) to update the representations of original nodes in \(\mathcal {G}\). The importance of messages at different levels can be different for other tasks. Hence, we adopt the attention mechanism Velickovic et al. (2018) to adaptively learn the contribution weights of different levels during top-down integration, given by:

    $$\begin{aligned} \textbf{h}^{(\ell )}_{v} = \textrm{ReLU}(\textbf{W} \cdot \textsc {MEAN} \{ \alpha _{uv} \textbf{b}^{(\ell )}_{u} \}), \forall u \in \mathcal {C}(v) \cup {\{v\}} \end{aligned}$$
    (5)

    where \(\alpha _{uv}\) is a trainable normalised attention coefficient between node v to super node u or itself, \(\textsc {MEAN}\) is an element-wise mean operation, \(\mathcal {C}(v)\) denotes the set of different-level super nodes from level \(\{2, \dots , K\}\) that node v belongs to (\(\vert \mathcal {C}(v) \vert =K-1\)), and \(\textrm{ReLU}\) is the activation function. \(\textbf{H}^{(\ell )}\) is the generated node representation of layer \(\ell \) with \(\textbf{h}^{(\ell )}_{v} \in \textbf{H}^{(\ell )}\). We generate the output node representations of the last layer (L) via:

    $$\begin{aligned} \textbf{z}_{v} = \sigma (\textbf{W} \cdot \textsc {MEAN} \{ \alpha _{uv} \textbf{b}^{(L)}_{u} \}), \forall u \in \mathcal {C}(v) \cup {\{v\}} \end{aligned}$$
    (6)

    where \(\sigma \) is the Euclidean normalisation function to reshape values into [0, 1]. \(\textbf{Z} \in \mathbb {R}^{n \times d}\) is the final generated node representation with each row vector \(\textbf{z}_{v} \in \textbf{Z}\).

IV. Model learning The proposed HMGNNs could be trained in unsupervised, semi-supervised, or supervised settings. Here, we only discuss the supervised setting used for node classification in our experiments. We define the loss function based on cross entropy, as follows:

$$\begin{aligned} \mathcal {L} = -\sum _{v\in \mathcal {V}} \textbf{y}^{\top }_{v} \log (\textrm{Softmax} (\textbf{z}_{v})) \end{aligned}$$
(7)

where \(\textbf{y}_{v}\) is a one-hot vector denoting the label of node v. We allow \(\mathcal {L}\) to be customised for other task-specific objective functions, e.g., the negative log-likelihood loss Velickovic et al. (2018).

figure a

We summarise the process of Hierarchical Message-passing Graph Neural Networks in Algorithm 1. Given a graph \(\mathcal {G}\), we first generate the hierarchical structure and combine it with the original graph \(\mathcal {G}\), to obtain \(\mathcal {H} = \{ \mathcal {G}_{t} \, \vert \, t=1, 2, \dots , T \}\), where \(\mathcal {G}_{1}=\mathcal {G}\) (line 2). For each node, including original and generated super nodes, in each NN layer, we perform three primary operations in order: (1) bottom-up propagation (line 6), (2) within-level propagation (line 7), and (3) top-down propagation (line \(9\!-\!15\)). After getting the representation vector of each node that is enhanced with informative long-range interactions and multi-grained semantics, and we train the model with the loss function \(\mathcal {L}\) in Eq. 7.

4.2 Hierarchical community-aware GNN

Identifying hierarchical super nodes for the proposed HMGNNs is the most crucial step as it determines how the information will be propagated within and between levels. We consider hierarchical network communities to construct the hierarchy. The network community has been proved helpful for assisting typical network analysis tasks, including node classification (Wang et al. 2014, 2017) and link prediction (Sun and Han 2012; Rossetti et al. 2015). Taking the algorithm efficiency into account and avoiding introducing additional hyper-parameters, i.e., the number of hierarchy levels, we adopt the well-known Louvain algorithm Blondel et al. (2008) to build the first implementation of HMGNNs, termed as Hierarchical Community-aware Graph Neural Network (HC-GNN). The Louvain algorithm returns us a hierarchical structure as described in Sect. 4.1 without the need for a pre-defined number of hierarchies, based on which we can learn node representations involving long-range interactive information and multi-grained semantics. Due to page limit, we include more details about community detection algorithms in App. A.

4.3 Theoretical analysis and model comparison

Long-range interactive capability We now theoretically analyse the asymptotic complexity of different GNN models to capture long-range interaction. We first analyse flat GNN models, that they need to stack \(\mathcal {O}(\textrm{diam}(\mathcal {G}))\) layers to ensure the communication between any pair of nodes in \(\mathcal {G}\). For HMGNNs, let us assume the pooling ratio \(\lambda =\vert \mathcal {V}_{t\!+\!1}\vert / \vert \mathcal {V}_{t}\vert \). Thus, the potentially total number of nodes in HMGNNs over \(\mathcal {G}\) with n nodes is \(\sum _{t=1}^{\infty } n \lambda ^{t}=\mathcal {O}(n)\), while the number of possible levels is \(\log _{\lambda ^{-1}} n=\mathcal {O}(\log n)\). That said, the shortest path between any two nodes of \(\mathcal {G}\) is upper-bounded by \(\mathcal {O}(\log n)\). Compared to \(\mathcal {O}(\textrm{diam}(\mathcal {G}))\) with flat GNNs, HMGNNs leads to significant improvement over the capability in capturing long-range interactions.

Model complexity For the vanilla flat GNN model, i.e., GCN, its computational complexity of one layer is \(\mathcal {O}(n^{3})\) Kipf and Welling (2017), and the computational complexity of a GCN model contains \(\ell \) is \(\mathcal {O}(\ell n^{3})\). For another attention-enhanced flat GNN model, i.e., Graph Attention Network (GAT) Velickovic et al. (2018), except for the same convolutional operation as GCN, the additional masked attention over all nodes requires \(\mathcal {O}(\ell n^{2})\) computational complexity Velickovic et al. (2018). Thus, overall it takes \(\mathcal {O}(\ell (n^{3} + n^{2}))\) complexity. For the hierarchical representation model, graph U-Net (g-U-Net) Gao and Ji (2019), its computational complexity of one hierarchy is \(\mathcal {O}(2 \ell n^{3})\), because its unpooling operation introduces another \(\mathcal {O}(\ell n^{3})\) complexity, in addition to the convolutional operations as GCN. Thus the complexity of g-U-Net with T levels is \(\sum _{t=1}^{T}2\ell (n\lambda ^{t\!-\!1})^{3} = \mathcal {O}(2\ell n^{3})\), since the pooled graphs are supposed have much smaller number of nodes than \(\mathcal {G}\). For HC-GNN, take GCN as an example GNN encoder and the Louvain algorithm as an example hierarchical structure construction method, which has optimal \(O(n \log c)\) computational complexity Traag (2015), where c is the average degree. The top-down propagation allows each node of \(\mathcal {G}\) to receive T different messages from T levels with different weights, this introduces \(\mathcal {O}(Tn)\) computational complexity, where T is the number of levels, and we assume \(T \ll n\). Altogether, the complexity of HC-GNN is \(\sum _{t=1}^{T} \ell (n \lambda ^{t\!-\!1})^{3} + \mathcal {O}(n \log c + Tn) = \mathcal {O}(\ell n^{3} + n \log c + Tn)\), which is more efficient than GAT and g-U-Net.

5 Experiments

We conduct extensive experiments to answer 6 research questions (RQ):

  • RQ1: How does HC-GNN performs vs. state-of-the-art methods for node classification (RQ1-1), community detection (RQ1-2), and link prediction (RQ1-3)?

  • RQ2: Can HC-GNN leads to satisfying performance under settings of transductive, inductive, and few-shot learning?

  • RQ3: How do different levels in the hierarchical structure contribute to the effectiveness of node representations?

  • RQ4: How do various hierarchical structure generation methods affect the performance of HC-GNN?

  • RQ5: Does HC-GNN survive from low sparsity of graphs?

  • RQ6: Does HC-GNN available with different encoders?

Table 2 Summary of dataset statistics. LP: Link Prediction, NC: Node Classification, CD: Community Detection, N.A. means a dataset does not contain node features or node labels

5.1 Evaluation setup

Datasets We perform experiments on both synthetic and real-world datasets. For the link prediction task, we adopt 3 datasets:

  • Grid You et al. (2019). A synthetic 2D grid graph representing a \(20 \times 20\) grid with \(\vert \mathcal {V} \vert =400\) and no node features.

  • Cora Sen et al. (2008). A citation network consists of 2, 708 scientific publications and 5, 429 links. A 1, 433 dimensional word vector describes each publication as a node feature.

  • Power Watts and Strogatz (1998). An electrical grid of western US with 4, 941 nodes and 6, 594 edges and no node features.

For node classification, we use 6 datasets: including Cora, Citeseer Kipf and Welling (2017) and Pubmed Kipf and Welling (2017) and a large-scale benchmark dataset Ogbn-arxiv Hu et al. (2020) for transductive settings, and 2 protein interaction networks Protein and PPI Ying et al. (2018) for inductive settings.

  • Cora. The same above-mentioned Cora dataset contains 7 classes of nodes. Each node is labelled with the class it belongs to.

  • Citeseer Sen et al. (2008). Each node comes with 3, 703-dimensional node features.

  • Pubmed Namata et al. (2012). A dataset consists of 19, 717 scientific publications from PubMed database about diabetes classified into one of 3 classes. Each node is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.

  • PPI Zitnik and Leskovec (2017). 24 protein-protein interaction networks and nodes of each graph comes with 50 dimensional feature vector.

  • Protein Borgwardt et al. (2005). 1113 protein graphs and nodes of each graph comes with 29 dimensional feature vector. Each node is labelled with a functional role of the protein.

  • Ogbn-arxiv Hu et al. (2020). A large-scale citation graph between 169, 343 computer science arXiv papers. Each node is an arXiv paper, and each directed edge indicates that one paper cites another one. Each paper comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. The task is to predict the 40 subject areas of these papers.

For node community detection, we use an email communication dataset:

  • Emails Leskovec and Krevl (2014). 7 real-world email communication graphs from SNAP with no node features. Each graph has 6 communities, and each node is labelled with the community it belongs to.

The data statistics of datasets is summarised in Table 2 and they are available for download with our published code.

Experimental settings We evaluate HC-GNN under the settings of transductive and inductive learning. For node classification, we additionally conduct experiments with the few-shot setting.

  • Transductive learning For link prediction, we follow the experimental settings of You et al. (2019) to use \(10\%\) existing links and an equal number of non-existent links as validation and test sets. The remaining \(80\%\) existing links and a dual number of non-existent links are used as the training set. For node classification, we follow the semi-supervised settings of Kipf and Welling (2017): if there are enough nodes, for each class, we randomly sample 20 nodes for training, 500 nodes for validation, and 1000 nodes for testing. For the Emails dataset, we follow the supervised learning settings of Huang et al. (2019) to randomly select \(80\%\) nodes as the training set, and use the two halves of remaining as the validation and test set, respectively. We report the test performance when the best validation performance is achieved.

  • Inductive learning This aims at examining a model’s ability to transfer the learned knowledge from existing nodes to future ones that are newly connected to existing nodes in a graph. Hence, we hide the validation and testing graphs during training. We conduct the experiments for inductive learning using PPI and Protein datasets. We train models on \(80\%\) graphs to learn an embedding function f and apply it on the remaining \(20\%\) graphs to generate the representation of new-coming nodes.

  • Few-shot learning Since the cost of collecting massive labelled datasets is high, having a few-shot learning model would be pretty valuable for practical applications. Few-shot learning can also be considered as an indicator to evaluate the robustness of a deep learning model. We perform few-shot node classification, in which only 5 samples of each class are used for training. The sampling strategies for testing and validation sets follow those in transductive learning.

Evaluation metrics We adopt AUC to measure the performance of link prediction. For node classification, we use micro- and macro-average F1 scores and accuracy. NMI score is utilised for community detection evaluation.

Competing methods To validate the effectiveness of HC-GNN, we compare it with 10 competing methods which include 6 flat message-passing GNN models, (GCN Kipf and Welling (2017), GraphSAGE Hamilton et al. (2017), GAT Velickovic et al. (2018), GIN Xu et al. (2019), P-GNNs You et al. (2019), GCNII Chen et al. (2020)), 3 hierarchical GNN models (HARP Chen et al. (2018), g-U-Net Gao and Ji (2019), GXN Li et al. (2020)) and another state-of-the-art model. (GraphRNA Huang et al. (2019)). For more details about competing methods, refer to App. B.

Reproducibility For fair comparison, all methods adopt the same representation dimension (\(d=32\)), learning rate (\(=1e\!-\!3\)), Adam optimiser and the number of iterations (\(=200\)) with early stop (50). In terms of the neural network layers, we report the one with better performance of GCNII with better performance among \(\{8, 16, 32, 64, 128\}\); for other models, we report the one with better performance between \(2\!-\!4\); For all models with hierarchical structure (including g-U-Net and HC-GNN), we use GCN as the default GNN encoder for fair comparision. Note that for the strong competitor, P-GNNs, since its representation dimension is related to the number of nodes in a graph, we add a linear regression layer at the end of P-GNNs for node classification tasks to ensure its end-to-end structure is the same as other models Huang et al. (2019). For HC-GNN, the number of HC-GNN layers is varied and denoted as 1L, 2L or 3L. In Sect. 5.3, HC-GNN adopts the number of layers leading to the best performance for model analysis i.e., 2L for the Cora dataset, 1L for the Citeseer and Pubmed datasets. For Louvain community detection, we use the implementation of a given package,Footnote 2 which does not require any hyper-parameters. We use PyTorch Geometric to implement all models mentioned in this paper. More details are referred to our code file.Footnote 3 The experiments are repeated 10 times, and average results are reported. Note that we use only node features with unique one-hot identifiers to differentiate different nodes if there are no given node features from the datasets and use the original node features if they are available. We employ Pytorch to implement all models. Experiments were conducted with GPU (NVIDIA Tesla V100) machines.

5.2 Experimental results

Table 3 Results in Micro-F1 and Macro-F1 for transductive semi-supervised node classification task

Transductive node classification (RQ1-1 &RQ2) We present the results of transductive node classification in Table 3. We can see that HC-GNN consistently outperforms all of the competing methods in the 5 datasets, and even the shallow HC-GNN model with only one layer may lead to better results. We think the outstanding performance of HC-GNN results from two aspects: (a) the hierarchical structure allows the model to capture informative long-range interactions of graphs, i.e., propagating messages from and to distant nodes in the graph; and (b) the meso- and macro-level semantics reflected by the hierarchy is encoded through bottom-up, within-level, and top-down propagations. On the other hand, P-GNNs, HARP, and GraphRNA perform worse in semi-supervised node classification. The possible reason is they need more training samples, such as using \(80\%\) of existing nodes as the training set, as described in their papers (You et al. 2019; Huang et al. 2019), but we have only 20 nodes for training in the semi-supervised setting.

Table 4 Micro-F1 results for inductive node classification. Standard deviation errors are given

Inductive node classification (RQ1-1 &RQ2) The results are reported in Table 4.Footnote 4 We can find that HC-GNN is still able to show some performance improvement over existing GNN models. But the improvement gain is not so significant and inconsistent in different layers of HC-GNN compared to the results in transductive learning. The possible reason is that different graphs may have other hierarchical community structures. Nevertheless, the results lead to one observation: the effect of transferring hierarchical semantics between graphs for inductive node classification is somewhat limited. Therefore, exploring an ameliorated model that can adaptively exploit hierarchical structure for different graphs for different tasks would be interesting. We further discuss it in Sect. 6 as one concluding remark.

Table 5 Micro-F1 results for few-shot node classification

Few-shot node classification (RQ1-1 &RQ2) Table 5 demonstrates better performance in few-shot learning than all competing methods across 3 datasets. Such results indicate that the hierarchical message passing is able to transfer supervised information through inter- and intra-level propagations. In addition, the hierarchical message-passing pipeline further enlarges the influence range of supervision information from a small number of training samples. With effective and efficient pathways to broadcast information, HC-GNN is proven to be quite promising in few-shot learning.

Community detection (RQ1-2) The community detection results conducted on the Emails dataset are also shown in Table 3. It can be seen that HC-GNN again outperforms all competing methods. We believe this is because the communities identified by Louvain are further exploited by learning their hierarchical interactions in HC-GNN. In other words, HC-GNN is able to reinforce the intra- and inter-community effect and encode it into node representations.

Table 6 Results in AUC for link prediction. Standard deviation errors are given

Link prediction (RQ1-3) Here, we motivate our idea by considering pairwise relation prediction between nodes. Suppose a pair of nodes uv are labelled with label y, and our goal is to predict y for unseen pairs. From the perspective of representation learning, we can solve the problem via learning an embedding function f that computes the node representation \(\textbf{z}_{v}\), where the objective is to maximise the likelihood of distribution \(p(y \vert \textbf{z}_{u}, \textbf{z}_{v})\). The results in Table 6 indicate that the HC-GNN leads to competitive performance compared to all competing methods, with up to \(11.7\%\) AUC improvement, demonstrating its effectiveness on link prediction tasks. When node features are accessible (i.e., Cora-Feat and Power-Feat), all models perform relatively well, and g-U-Net has slightly better performance on Cora-Feat dataset. Because node features provide meaningful information to predict pairwise relations. Another interesting perspective is investigating the models’ performance without contextual node features (e.g., Grid, Cora-NoFeat and Power-NoFeat). It is surprising that HC-GNN variants show great superiority in these three datasets. We argue that when only topological information is available, the hierarchical semantics introduced by HC-GNN helps find missing links.

5.3 Empirical model analysis

Fig. 3
figure 3

Results in Micro-F1 for semi-supervised node classification using HC-GNN by varying: a the number of hierarchy levels adopted for message passing, and b the approaches to generate the hierarchical structure. 2T means model with first 2 hierarchy levels

Contribution of different levels (RQ3) Since HC-GNN highly relies on the generated hierarchical structure, we aim to examine how different levels in the hierarchy contribute to the prediction. We report the transductive semi-supervised node classification performance by varying the number of levels (from 1T to 4T). GCN is also selected for comparison because it considers no hierarchy, i.e., only within-level propagation in the original graph. The results are shown in Fig. 3a, in which 1T and 2T indicate only the first hierarchy level and the first 2 hierarchy levels are adopted, respectively. We can find that HC-GNN using more levels for hierarchy construction lead to better results. The flat message passing of GCN cannot work well. Such results provide strong evidence that GNNs can significantly benefit from the hierarchical message-passing mechanism. In addition, more hierarchical semantics can be encoded if more levels are adopted.

Influence of hierarchy generation approaches (RQ4) HC-GNN implements the proposed Hierarchical Message-passing Graph Neural Networks based on the Louvain community detection algorithm, that is termed HC-GNN-Louvain in this paragraph. We aim to validate (A) whether the community information truly benefits the classification tasks, and (B) how different approaches to generate the hierarchical structure affect the performance. To answer (A), we construct a random hierarchical structure to generate randomised HC-GNN, termed HC-GNN-Random, in which Louvain detects hierarchical communities, and nodes are randomly swapped among the same-level communities. In other words, the hierarchy structure is maintained, but community memberships are perturbed. The results on semi-supervised node classification are exhibited in Fig. 3b. We can see that HC-GNN-Random works worse than GCN in Cora and Pudmed, and much worse than HC-GNN-Louvain. It implies that hierarchical communities generated from the graph topology genuinely lead to a positive effect on information propagation. Meanwhile, it is surprisingly found that HC-GNN-Random achieves better performance than GCN on Citeseer. We argue this is because HC-GNN-Random has the ability to spread supervision information in the hierarchy structure, leading to the occasional improvement. To answer (B), we utilise Girvan Newman Girvan and Newman (2002) to produce the hierarchical structure by following the same way described in Sect. 4.1, and have a model named HC-GNN-Girvan Newman. The results are shown in Fig. 3b. Although HC-GNN-Girvan Newman is not as effective as HC-GNN-Louvain, they still outperform GCN. Such a result indicates that the approaches to generate the hierarchical structure will influence the capability of HC-GNN. While HC-GNN-Louvain leads to promising performance, one can search for a proper hierarchical community detection method to perform better on different tasks.

Fig. 4
figure 4

Results on semi-supervised node classification in graphs by varying the percentage of removed edges

Influence of graph sparsity (RQ5) Since community detection algorithms are sensitive to the sparsity of the graph Nadakuditi and Newman (2012), we aim at studying how HC-GNN perform under graphs with low sparsity values in the task of semi-supervised node classification. We consider two kinds of sparsity: one is graph sparsity by randomly removing a percentage of edges from all edges in the graph, i.e., \(10\%-50\%\); the other is node sparsity by randomly drawing a portion of edges incident to every node in the graph. The random removal of edges can be considered that users hide partial connections due to privacy concerns. The results for Cora and Citeseer are presented in Fig. 4. HC-GNN significantly outperforms the competing methods on graph sparsity and node sparsity under different edge-removal percentages. Such results prove that even though communities are subject to sparse graphs, but it will not damage HC-GNN’s performance making it worse than other competing models.

Table 7 Comparison of HC-GNN with different primary GNN encoders (within-level propagation), follow the transductive node classification settings

Ablation study of different primary GNN encoders (RQ6) We adopted GCN as the default primary GNN encoder in model presentation (Sect. 4) and previous experiments. Here, we present more experimental results by endowing HC-GNN with advanced GNN encoders in Table 7. The table demonstrates that advanced GNN encoders can still benefit from the multi-grained semantics of HC-GNN. For instance, GCNII can stack lots of layers to capture long-range information; however, it still follows a flat message-passing mechanism hence naturally ignoring the multi-grained semantics. HC-GNN further ameliorates this problem for better performance.

6 Conclusion and future work

This paper has presented a novel Hierarchical Message-passing Graph Neural Networks (HMGNNs) framework, which deals with two critical deficiencies of the flat message passing mechanism in existing GNN models, i.e., the limited ability for information aggregation over long-range and infeasible in encoding meso- and macro-level graph semantics. Following this innovative idea, we further presented the first implementation, Hierarchical Community-aware Graph Neural Network (HC-GNN), with the assistance of a hierarchical communities detection algorithm. The theoretical analysis confirms HC-GNN’s significant ability in capturing long-range interactions without introducing heavy computation complexity. Extensive experiments conducted on 9 datasets show that HC-GNN can consistently outperform state-of-the-art GNN models in 3 tasks, including node classification, link prediction, and community detection, under settings of transductive, inductive, and few-shot learning. Furthermore, the proposed hierarchical message-passing GNN provides model flexibility. For instance, it friendly allows different choices and customised designs of the hierarchical structure, and it incorporates well with advanced flat GNN encoders to obtain more impressive results. That said, the HMGNNs could be easily applied to work as a general practical framework to boost downstream tasks with arbitrary hierarchical structure and encoder.

The proposed hierarchical message-passing GNNs provide a good starting point for exploiting graph hierarchy with GNN models. In the future, we aim to incorporate the learning of the hierarchical structure into the model optimisation of GNNs such that a better hierarchy can be searched on the fly. Moreover, it is also interesting to extend our framework for heterogeneous networks.