1 Introduction

A graph consisting of nodes and edges (or links) is a data structure to model complex real-world systems. Graphs are widely used to represent a wide range of relations between physical or conceptual entities, such as communication networks and social networks. Graph-structured biological data, such as gene regulatory networks (GRNs), protein–protein interaction (PPI) networks, and brain connectivity networks, are extensively growing in biomedical and bioinformatic domains, ranging from molecule structures to medical imaging systems. Graph Neural Networks (GNNs) have been introduced for geometric deep learning abstracting meaningful representations from graph-structured data (Bronstein et al. 2017; Wu et al. 2021b). Graph Convolutional Neural Network (GCN) is one of the early representative GNNs that propagates node information along edges and aggregates with their message passing mechanism in graph node classification (Ye et al. 2022). Theoretically, neighbor aggregation in GNNs can be viewed as an aggregation function over nodes in the graph, and a GNN with representational capacity should be able to differentiate distinct topologies (Xu et al. 2019). While individual nodes can easily aggregate the features of their neighbors, this localized summation does not capture the broader structural nuances of the graph. Sum aggregation alone cannot distinguish between graphs with the same node features but differing topologies. Consequently, this prompts the need for advanced methods capable of learning graph representations that encompass both node attributes and the graph’s inherent structural complexity.

Graph pooling is an essential component of GNNs for graph-level representations. The goal of graph pooling is to learn a graph representation that captures topology, node features, and other relational characteristics in the graph, which can be used as input to downstream machine learning (ML) tasks. Typically, there are two types of graph pooling: (1) global pooling or readout to condense the input graph into a single vector, and (2) hierarchical pooling to condense the input graph as a smaller-sized graph. Hierarchical pooling and global pooling work in different modules of GNNs, with hierarchical pooling employed in the feature extraction module and global pooling used to connect the feature extraction module with downstream tasks. Despite substantial differences in output and purpose, both types of pooling can be described using a unified framework. Global pooling can also be regarded as a special form of hierarchical pooling. Specifically, global pooling maps an arbitrary-sized original graph to a graph with only one node, and the node embedding of pooled graph serve as the graph representation of the original graph.

Graph pooling, to a certain extent, is inspired by pooling operators in Convolutional Neural Network based (CNN-based) tasks in computer vision. In CNNs, a downsampling or typical pooling layer can be defined as: \(pool\left(pixel\right)=P\left(\{CNN\left({pixel}{\prime}\right):{pixel}{\prime}\in \mathcal{N}\left(pixel\right)\}\right)\), where \(\mathcal{N}\left(pixel\right)\) is \(pixel\)’s neighborhood and \(P(\cdot )\) is a permutation-invariant function like an \({L}_{p}\)-norm (Bronstein et al. 2017). Pooling layers expand the CNN’s receptive field, enhance representation, and reduce sensitivity to input changes by transforming local data into abstracted high-level features. (Akhtar and Ragavendran 2020). For early attempts to generalize CNN architectures to the graph domain, namely spectral CNNs or spectral-based GCNs, the geometric analogy of pooling is graph coarsening, in which only a fraction of the graph nodes are retained (Bronstein et al. 2017). For the cross-domain spatial-based approaches, it is feasible to aggregate all local information into a single vector, but all spatial information will be lost after such pooling. These can be viewed as early prototypes of hierarchical graph pooling and global graph pooling.

Practical tasks like graph classification and graph clustering drive the transition from global to hierarchical pooling. Graph pooling is essential for reducing dimensions, adapting to variable graph structures, hierarchical extracting crucial substructures, and embedding knowledge into representations (Bacciu et al. 2020; Cheung et al. 2020). By creating smaller graphs, graph pooling minimizes parameters, curbing overfitting, oversmoothing, and computational load. (Wang et al. 2020c; Ye et al. 2022). Graph pooling operators, unlike coarsening techniques reliant on Laplace matrices, accommodate graphs with varying node counts. Graph pooling operators facilitate graph alignment by collapsing graphs of different sizes to coarsened versions with a uniform number of supernodes, allowing the mapping of graph signals to a standardized hypergraph structure. Hierarchical learning on key substructures explicitly extracts structural information, incorporates it into graph representation, and discovers the model’s prediction preference for certain parts of the graph. This preference helps researchers to understand how the model makes the decisions, and thereby understand potential patterns present in the graphs (Adnan et al. 2020; Li et al. 2021b; Tang et al. 2022; Zhang et al. 2023b).

Numerous reviews have summarized graph neural networks and representation learning (Zhou et al. 2020a, 2022; Makarov et al. 2021; Zhang et al. 2022), yet few have delved into graph pooling with limited methods (Liu et al. 2022a; Grattarola et al. 2022). Existing reviews on graph pooling offer taxonomies and mathematical overviews but tend to catalog methods, often missing recent advancements. They tend to concentrate on a select few classical pooling techniques, leading to an incomplete picture of global pooling and the variety in operator designs. Moreover, GNNs, along with graph pooling operators, have been widely and successfully applied to various real-world tasks, such as transportation systems (Rahmani et al. 2023), power systems (Liao et al. 2022), electronic design automation (Sánchez et al. 2023), and materials science (Gong and Yan 2021; Reiser et al. 2022), receiving significant attention and thorough reviews. However, applications on biological networks from omics data, a key topic in graph modeling, have not received adequate attention and organization.

High-throughput technologies have rapidly accumulated vast patient data, yet biomedical research still lacks sufficient knowledge. Computational bioinformatics is crucial for managing and utilizing this data, especially in precision medicine and cancer research, where integrating multi-omics data offers unprecedented opportunities for understanding complex diseases. Now, omics is a broad field in biological sciences that characterizes and quantifies biological molecules to understand an organism’s structure, function, and dynamics (Kaur et al. 2021). Omics began with genomics, focusing on the whole genome, rather than single genes or variants, and it has since expanded to include various disciplines, each targeting different biomolecules or processes, such as the proteome, transcriptome, and metabolome. Data types have also evolved from traditional structured formats to non-structured, semi-structured, and heterogeneous architectures with diverse characteristics (Li et al. 2022a). Despite Zhang et al. and Li et al. summarizing the success of ML and deep neural networks on omics data, many graph-based methods remain unreviewed and unsystematized (Zhang et al. 2019b; Li et al. 2022a).

To close these gaps, this paper thoroughly reviews current global and hierarchical global pooling operators and summarizes the applications of graph pooling operators in omics data as a notable example of broad applicability for real-world domains. Specifically, the main contributions of our paper are as follows: (1) we propose a taxonomy for global pooling, extend the classification of hierarchical pooling, and provide reviews for hybrid pooling, edge-based pooling, and inverse operation of graph pooling for the first time; (2) we discuss the evaluation framework for graph pooling operators and several open issues related to the design and application; (3) we summarize the representative bioinformatics application in omics data, demonstrating how they enhance predictive performance, provide model interpretability, and drive research advancements in specific practical domains.

This survey includes conference and journal publications on graph pooling operators and related omics applications indexed by the Web of Science (WoS) and published between April 2014 and March 2024. We also included several preprints (arXiv papers) that have not been peer-reviewed or formally published in March 2024. This review is organized as follows: Sect. 2 briefly describes definitions of GNNs (Sect. 2.1), and related surveys on graph pooling and omics applications (Sect. 2.2) with the aim of explaining key concepts for the uninitiated reader. Section 3 details the taxonomy and computational flows of graph pooling (Sects. 3.1 and 3.2), the inverse operation of pooling (Sect. 3.3), evaluation framework (Sect. 3.4.1), and open problems of graph pooling operators (Sects. 3.4.23.4.4). It seeks to update researchers on the latest in graph pooling and provide a roadmap for developing and accessing new operators. Section 4 analyzes the representative applications on omics, including genomics (Sect. 4.1), radiomics (Sect. 4.2), and proteomics (Sect. 4.3), highlighting the necessary adaptations, advantages, and persistent challenges of graph pooling in practical contexts. This section also aims to help bioinformatics researchers in choosing appropriate pooling methods for similar scenarios. Section 5 concludes the survey, and outlines prospective research directions in graph pooling.

2 Preliminaries

2.1 Definitions

Graph. Given a graph \(G=(V, E, {\varvec{X}}, {\varvec{A}})\) where \(V\) is the set of nodes with \(N=|V|\), \(E\) is the set of edges, \({\varvec{A}}\in {\mathbb{R}}^{N\times N}\) denotes the adjacency matrix and \({\varvec{X}}\in {\mathbb{R}}^{N\times d}\) denotes the node feature matrix in which each node has \(d\) features.

Graph convolution network. Given a GNN architecture with \(L\) layers of graph convolutions, the \(l\)-th layer computes the node representation \({{\varvec{h}}}_{v}^{l}\in {{\varvec{H}}}^{l}\) for a node \(v\in V\) by the neighborhood aggregation function (i.e., message passing (Gilmer et al. 2017)) and \({{\varvec{H}}}^{0}={\varvec{X}}\):

$$\begin{array}{c}{{\varvec{h}}}_{v}^{l}=Update\left({{\varvec{h}}}_{v}^{l-1},\,Aggregation\left({{\varvec{h}}}_{{v}_{j}}^{l-1}|{v}_{j}\in \mathcal{N}\left(v\right)\right)\right),\end{array}$$
(1)

where \(Update(\cdot , \cdot )\) is a learnable function with distinct weights at each layer for generating new node representations, \(Aggregation(\cdot )\) is a general learnable permutation-invariant function for receiving message from neighborhood, and \(\mathcal{N}(v)\) denotes neighborhood of node \(v\).

Graph representation learning. The task of graph representation learning is to learn the latent features of all nodes \({\varvec{H}}=\{{{\varvec{h}}}_{1}, ... , {{\varvec{h}}}_{N}\}\), and get the representation \({{\varvec{H}}}_{G}\) for the entire graph \(G\).

Graph classification. Given a set of labeled graphs \((\mathcal{G}, \mathcal{Y})=\{({G}_{1}({V}_{1}, {E}_{1}, {{\varvec{X}}}_{1}, {{\varvec{A}}}_{1}),{ y}_{1}), ({G}_{2}({V}_{2}, {E}_{2}, {{\varvec{X}}}_{2}, {{\varvec{A}}}_{2}), {y}_{2}), ...\}\) where \({y}_{i}\) is the label of\({G}_{i}\), the task of graph classification is to learn a mapping function \(\mathcal{F}:\mathcal{G}\to \mathcal{Y}\) that maps the set of graphs \(G\) to the set of labels \(\mathcal{Y}\) to predict the discrete labels for unknown graphs.

Graph regression. The task of graph regression consists of approximating a function \({\mathcal{F}}_{R}:\mathcal{G}\to Y\) where \(\mathcal{Y}\in {\mathbb{R}}\) is the set of ground truth and predicting the continuous proprieties of graphs.

Graph signal classification. A graph signal \({\varvec{X}}\in {\mathbb{R}}^{N\times d}\) is defined as the matrix containing the features of the nodes in the graph. Given a set of labeled node feature matrices \((\mathcal{X}, \mathcal{Y})=\{({{\varvec{X}}}_{1},{ y}_{1}), ({{\varvec{X}}}_{2}, {y}_{2}), ...\}\) where node feature matrices constitute signals supported on the same graph\({G}_{sup}\), the task of graph signal classification is to learn a mapping function \({\mathcal{F}}_{sup}:({G}_{sup}, \mathcal{X})\to \mathcal{Y}\) that maps signals on \({G}_{sup}\) to the labels and predict the labels for unknown signals.

Hierarchical pooling. A hierarchical pooling operator can be defined as a function \({\mathcal{F}}_{P}\) that maps a graph \(G=(V, E, {\varvec{X}}, {\varvec{A}})\) to a coarsened graph \({G}_{p}=({V}_{p}, {E}_{p}, {{\varvec{X}}}_{{\varvec{p}}}, {{\varvec{A}}}_{{\varvec{p}}})\), where generally \(|{V}_{p}|<|V|\) and \({{\varvec{X}}}_{{\varvec{p}}}\) and \({{\varvec{A}}}_{{\varvec{p}}}\) are transformed from \({\varvec{X}}\) and \({\varvec{A}}\) using matrix multiplication or indexing operation, as in:

$$\begin{array}{c}{G}_{p}({V}_{p}, {E}_{p}, {{\varvec{X}}}_{{\varvec{p}}}, {{\varvec{A}}}_{{\varvec{p}}})={\mathcal{F}}_{P}(G(V, E, X, A)), |{V}_{p}|<|V|.\end{array}$$
(2)

Global pooling (Readout). A global pooling operator, also called readout function, computes a graph representation vector \({{\varvec{h}}}_{G}^{l}\in {\mathbb{R}}^{{N}^{l}}\) for a graph \(G\) from its node representations \({{\varvec{H}}}^{l}\in {\mathbb{R}}^{{N}^{l}\times {d}^{l}}\) of the \(l\)-th layer, as in:

$$\begin{array}{c}{{\varvec{h}}}_{G}^{l}=Readout({{\varvec{H}}}^{l}), {{\varvec{H}}}^{l}=\{{{\varvec{h}}}_{v}^{l}|v\in V\}, {{\varvec{h}}}_{G}^{l},{{\varvec{h}}}_{v}^{l}\in {\mathbb{R}}^{{N}^{l}}.\end{array}$$
(3)

2.2 Related works

Even though there are already several reviews summarizing graph neural networks or graph representation learning algorithms (Zhou et al. 2020a, 2022; Makarov et al. 2021; Zhang et al. 2022), only a few works have exclusively focused on graph pooling (Liu et al. 2022a; Grattarola et al. 2022). These reviews included taxonomies and mathematical descriptions of existing methods, as well as a summary of implementation and operational frameworks. Grattarola et al. elucidate graph pooling as a combination of three main operations: selection, reduction, and connection, so that all graph pooling operators can be unified under a common framework (Grattarola et al. 2022). They propose taxonomy of pooling operators based on the four properties: trainability, density of the supernodes, adaptability, and hierarchy. Similarly, Liu et al. divide pooling operators into two categories, namely flat pooling and hierarchical pooling, and propose universal and modularized frameworks for describing the process of node clustering pooling and node drop pooling (Liu et al. 2022a). Due to the limited number of reviewed literatures, earlier works mostly focused on hierarchical pooling and emphasized the commonalities among pooling operators, resulting in a less comprehensive overview of global pooling and the diversity of pooling operator designs (Liu et al. 2022a; Grattarola et al. 2022). Yang et al. identified graph pooling as one of the four representative algorithms for graph-level learning and further explored the subcategories within global and hierarchical pooling (Yang et al. 2023a). In their taxonomy, global pooling encompasses numeric operations, attention-based, CNN-based, and global top-K methods, while hierarchical pooling is divided into three branches: clustering-based, hierarchical top-K, and tree-based.

Moreover, the application of graph neural networks in bioinformatics has gained significant attention. High-throughput omics data analysis for reconstructing biological networks is challenging, yet it enables the creation of varied networks including PPI, GRN, and networks related to metabolism, brain, and diseases (Sulaimany et al. 2018). Muzio et al. discuss the current domains in bioinformatics where GNNs are extensively applied, including proteomics, drug development and discovery, disease diagnosis, metabolic and GRNs (Muzio et al. 2021). The rise of single-cell sequencing has accelerated the generation of omics datasets, enhancing insights into cellular diversity and function. Consequently, it has fostered numerous cell and gene-centric graphs, highlighting GNNs as a key tool for single-cell analysis (Hetzel et al. 2021; Lazaros et al. 2024). Liu et al. reviewed and compared the performance of various GNN approaches for spatial clustering tasks on spatial transcriptomics (Liu et al. 2024). Zhang et al. surveyed deep learning’s role in genomics, transcriptomics, and proteomics, offering a streamlined guideline for resolving omics problems using deep learning (Zhang et al. 2019b). Li et al. explored the integration of artificial intelligence with an extensive spectrum of omics fields, including genomics, transcriptomics, proteomics, metabolomics, radiomics, as well as single-cell omics (Li et al. 2022a).

3 Graph pooling

Graph pooling operators, essential for downsampling in GNNs, transform a graph signal defined on the input graph to a matching graph signal defined on the coarsened graph, typically with fewer nodes. These methods fall into two categories: global pooling and hierarchical pooling (Zhou et al. 2020a, 2022; Liu et al. 2022a; Ye et al. 2022; Zhang et al. 2022). Global pooling condenses the graph to a representation vector, leveraging node attributes to enhance representational capacity, though often at the expense of structural data, local structural detail particularly (Bronstein et al. 2017; Xu et al. 2019; Murphy et al. 2019; Chen et al. 2021). It can be viewed as a special case of hierarchical pooling, in which the entire graph is collapsed or aggregated to one node. Hierarchical pooling, on the other hand, maintains significant substructures and adjacencies, allowing the preservation of important graph features while reducing complexity. Communities of nodes, and representative nodes or edges, can be identified as significant substructures and selected to build coarsened graphs (Li et al. 2020a; Tang et al. 2021; Yu et al. 2022). This section provides a comprehensive review of these operators, detailing their classifications, computational flow, and integration within GNN architectures. We discuss global pooling in Sect. 3.1, delve into the more intricate hierarchical pooling in Sect. 3.2, and explore unpooling, the inverse operation of pooling, in Sect. 3.3. In Sect. 3.4, we focus on several pioneering efforts in evaluating pooling operators, including benchmark datasets, recent available libraries, experimental and theoretical comparison, and suggestions for evaluating operators. And we also discuss several key considerations in their design and implementation, including computational complexity, network connectivity, adaptivity, additional loss functions, and the incorporation of attention mechanisms.

In this review, we propose general computational flows for the most prevalent pooling methods, as shown in Fig. 1, as well as a comprehensive categorization based on the core ideas. The frequently used methods for summarizing node features in global pooling can be categorized into the following groups, in order of increasing complexity: simple permutation-invariant functions, grouping and cascading, weighted summation based on attention, and learnable readout functions. Hierarchical pooling can be categorized into clustering pooling, selection pooling, edge pooling, and hybrid pooling based on the strategies used to retrieve relevant local structures. The term "hybrid pooling" refers to a group of pooling operators that use multiple techniques and consider the properties of various strategies.

Fig. 1
figure 1

Computational flow and categorization of graph pooling. Left: the taxonomy of graph pooling. Right: computational steps of common pooling methods

Figure 2 depicts the core mechanisms of pooling. The global pooling (Shown in Fig. 2a) aims to convert the graph into representation vectors. Clustering pooling considers local structures to connected subgraphs (i.e., node communities or node clusters), whereas selection pooling considers important local structures to be representative key nodes, and edge pooling focuses on edges rather than nodes. Clustering pooling (Fig. 2b) groups the nodes and aggregates the nodes in the same cluster as the supernode. The selection pooling (Fig. 2c) scores each node, thus retaining the top-ranked nodes and discarding the others. There are two kinds of edge pooling strategies: edge contraction (Fig. 2d) and edge deletion (Fig. 2e). The former chooses an edge and merges its connected vertices, whereas the latter chooses an edge and only retains its connected vertices, discarding other edges and nodes. Unpooling is the inverse operation of pooling utilized for upsampling on nodes and is mainly used to restore the coarsened graph to its earlier version, the fine graph. Graph unpooling (Fig. 2f) restores the original graph structure while restoring representations of dropped nodes via the current node representations.

Fig. 2
figure 2

Core mechanisms of different pooling operators. a Global pooling. b Clustering pooling. c Selection pooling. The colors in the nodes indicate the rating scores. d Edge contraction pooling. e Edge deletion pooling. The red lines indicate the selected edges. f Graph unpooling. Blank nodes symbolize recovered nodes and arrows on the edges show node representation propagation

In general, graph pooling operators are often applied to two levels of tasks: node-level tasks such as node classification and link prediction, and graph-level tasks in inductive learning. Aside from the most common graph-level task, graph classification, various GNNs can be used for graph regression, graph signal classification, graph generation, and graph reconstruction. Architectures of commonly used GNN are summarized in Fig. 3. GNN of the simple structure (Shown in Fig. 3a) consists of several consecutive GCN layers, a pooling layer, and some fully connected (FC) layers. FC layers with different activation functions can be considered as multilayer perceptron (MLP) for specialized tasks such as graph classification or graph regression. To reduce the scale of the graphs and extract features layer by layer, a hierarchical graph neural network (HGNN, shown in Fig. 3b) comprises many graph pooling operators interspersed with GCN layers. To incorporate information from coarsened graphs of varying scales and generate a more robust representation, a variation of the HGNN shown in Fig. 3c incorporates jump connections from Jumping Knowledge Networks (JK nets) (Cangea et al. 2018; Xu et al. 2018). The coarsened graphs have multiple ways of aggregating the readout results at each scale, including concatenation, addition, weighted summation, and parameterized methods (Chen et al. 2022c). Due to its capacity to retain information from jump connections as a residual network, the HGNN with jumping connections, called JK-net-style hierarchical architecture, has become the dominant form of GNNs with hierarchical pooling for graph-level tasks (He et al. 2016). Another HGNN version shown in Fig. 3d is known as parallel hierarchical pooling or multi-channel pooling. Different from the previous pooling operator, the subsequent pooling operator in this architecture is conducted on the input graph or the updated graph after message passing, thus the pooling operators are parallel to each other. Furthermore, each pooling operator focuses on different parts of the graph for the same graph structure, which makes multi-channel pooling. In this architecture, pooling operators from different channels will result in hypergraphs with distinct structures by scoring the same node differently or clustering the nodes differently (Roy et al. 2021; Xu et al. 2022). Additionally, the U-Net network structure from the computer vision field has been adapted to GNNs in recent studies, as shown in Fig. 3e (Ronneberger Olaf and Fischer 2015). Graph unpooling is used in tandem with the pooling operators to generate descent and ascent channels in U-Net. Such graph U-nets are talented at juggling graph classification and node classification (Gao and Ji 2019, 2022).

Fig. 3
figure 3

General GNN architectures with pooling operators. a Simple structure. b Hierarchical architecture. c JK-net-style hierarchical architecture. d Hierarchical architecture with parallel channels. e Graph U-Net. The symbols after the readout functions represent the aggregation mechanism for graph representations, which is commonly done by concatenation or summation. Unpooling operators in the graph U-Net use skip connections to fuse the position information of nodes from the pooling operators at the same level

3.1 Global pooling

The global graph pooling operators are employed as readout functions to transform the graph into a single low-dimensional dense vector. In practical usage, global pooling generally has wider applications than hierarchical pooling. Based on complexity and representation capability, representative methods of global pooling operators can be categorized into four categories, including simple functions, grouping/cascading, attention, and learnable functions, as in Tables S1–S4 (Supplementary File 1). Simple functions refer to simple permutation-invariant functions (Sect. 3.1.1), and Grouping/Cascading (Sect. 3.1.2) refers to a class of graph representation methods in grouping or cascading nodes. Attention refers to the method of weighted summation of node representations, with the weights typically filled by attention coefficients (Sect. 3.1.3). Learnable functions (Sect. 3.1.4) are parametric approaches, particularly utilizing neural networks.

3.1.1 Simple permutation-invariant functions

Simple permutation-invariant functions act on the features of the nodes in the graph while ignoring the connection relationship, namely the Sum, Mean, and Max functions. Sum and Mean functions were the first global pooling technique implemented (Duvenaud et al. 2015; Atwood and Towsley 2016). In early convolutional neural networks on graphs, the graph representation is created by summarizing the representations of all nodes in the graph with the Sum function, as shown in Eq. (4) (Duvenaud et al. 2015). The Mean function is essentially equal to summation and differs only in multiplicative factors, as shown in Eq. (5) (Sun et al. 2021; Pham et al. 2021; Bianchi et al. 2022b). The Max pooling refers to a more sophisticated element-wise Max function (Simonovsky and Komodakis 2017; Gao et al. 2021a). According to the theoretical representational power on multisets, the Sum has the strongest representation power, while the Mean is better than the Max function (Xu et al. 2019). Thus, many GNNs adopt Sum pooling as readout functions in practice (Duvenaud et al. 2015; Li et al. 2018; Morris et al. 2019; Yang et al. 2021b; Bacciu et al. 2021). Theoretically, calculating the mean value of node representations utilizes first-order statistical information on node representation. The node representation matrix’s second-order statistics can also be utilized, although certain adjustments are necessary to address problems such as large dimensionality (Wang and Ji 2023). The simple permutation-invariant readout function has the advantages of being easy to implement, easy to understand, and computationally efficient. These simple permutation-invariant readout functions guarantee permutation invariance by definition, which provides the same representation for isomorphic graphs and is robust to graph perturbations. On the other hand, these simple permutation-invariant readout functions, treat each node equally and fail to discern significant structures for isomorphic graphs.

$$\begin{array}{c}{{\varvec{h}}}_{G}^{l}=\sum\limits_{v}^{V}{{\varvec{h}}}_{v}^{l}\end{array}$$
(4)
$$\begin{array}{c}{{\varvec{h}}}_{G}^{l}=\frac{1}{N}\sum\limits_{v}^{V}{{\varvec{h}}}_{v}^{l}\end{array}$$
(5)

3.1.2 Grouping and cascading

To obtain a more expressive representation, grouping and cascading improve the graph representations by concatenating different representations. Methods in this category often start with the most expressive Sum function and cascade it with other functions (Bacciu et al. 2021; Gao et al. 2022a), but the range of values drifts due to the Sum function. One of the notable differences is that the representations produced by the Sum functions frequently differ in order of magnitude from the other representations. The concatenation of Mean and Max has been the most popular cascade approach, and is often selected as the default readout option of GNNs (Cangea et al. 2018; Luzhnica et al. 2019; Lee et al. 2019; Zhang et al. 2019a, 2021b; Qin et al. 2020; Yu et al. 2021, 2022; Bi et al. 2021). Figure 3c depicts another cascading form: connecting different layers’ readout results. DropGNN aggregates the node representations using Mean operation when running GNN independently multiple times before invoking the graph readout function, and the graph representations in each run are aggregated using an auxiliary readout function trained with auxiliary loss (Papp et al. 2021). Generally, any GNN with readout modules needs to be cascaded, and the representations need to be concatenated or added together. Eqs. (6 and 7) show the formulas for function-level and layer-level cascading, where \(CONCAT(\cdot )\) and \(||\) denote concatenation operations.

$$\begin{array}{c}{{\varvec{h}}}_{G}^{l}=\frac{1}{N}\sum_{v}^{V}{{\varvec{h}}}_{v}^{l}||Max({{\varvec{H}}}^{l})\end{array}$$
(6)
$$\begin{array}{c}{{\varvec{H}}}_{G}=CONCAT\left({{\varvec{h}}}_{G}^{l}|l=1,\dots ,L\right)\end{array}$$
(7)

Although cascading focuses on the most salient features of certain nodes, it cannot explicitly distinguish nodes with diverse statuses. Based on the grouping strategy, the global pooling operators used the divide and conquer principle, which refers to a set of methods dividing nodes into different groups before aggregating and cascading the group representations. DEMO-Net grouped nodes according to their degrees, grouping nodes with the same degree and allowing the readout scheme to learn a graph representation within a degree-specific Hilbert kernel space (Wu et al. 2019). Another strategy is to learn each node’s positions of graph structure, as well as its assigned communities (Roy et al. 2021; Li and Wu 2021; Lee et al. 2021). Roy et al. develop a structure-aware pooling readout that generates pooled representations for individual communities and identifies different substructures by utilizing topological indicators such as degree, clustering coefficient, and betweenness centralities, among other metrics (Roy et al. 2021). To keep the structural consistency for any input graph, SSRead (Structural Semantic Readout) predefines a specific number of structural positions and then maps the node representation to the position representation, i.e., each hidden vector is aligned with the semantically closest structural position (Lee et al. 2021). In other words, nodes are grouped according to their structural position. Aggregating node representations within the same group can be considered as a subproblem, and other approaches, such as simple functions, attention, and other global pooling methods, can be applied (Su et al. 2021; Duan et al. 2022). This is comparable to hierarchical pooling methods based on clustering or node selection, with the distinction that graph representation vectors are created directly without the need for intermediary graphs.

3.1.3 Weighted summation based on attention

For the issues that each node equally contributes to the output representation and multiple graphs mapping to the same representation, one popular solution is to replace a simple summation operation with a weighted sum of all node representations in the graph (Chen et al. 2019b; Aggarwal and Murty 2021). Furthermore, with the prevalence of attention mechanisms in deep neural networks, attention scores are found to be suited for variable weighting. In general, the readout function, which is an attention-based weighted summation, can be defined as Eq. (8), in which \(\tau (\cdot )\) denotes a linear or nonlinear transformation (Gilmer et al. 2017; Fan et al. 2020; Itoh et al. 2022). In practice, either matrix multiplication (Chen et al. 2019b; Wang and Ji 2023) or a variety of neural networks such as long short-term memory networks (LSTM) (Vinyals et al. 2016), GCN (Meltzer et al. 2019), MLP (Gilmer et al. 2017; Chen et al. 2019b, a; Li et al. 2019; Fan et al. 2020; Baek et al. 2021; Itoh et al. 2022), and many others can build differentiable attention mechanisms for training and learning in an end-to-end fashion.

$$\begin{array}{c}{{\varvec{h}}}_{G}^{l}=\sum_{v}^{V}(Softmax(\tau ({{\varvec{h}}}_{v}^{l}))\cdot {{\varvec{h}}}_{v}^{l})\end{array}$$
(8)

The key advantage of attention is its capability to quantitatively treat each feature differently, enabling the model to discern the essential information for classification and amplify the contribution of this relevant information through attention coefficients. To avoid identical attention coefficients, the Frobenius norm is introduced as a penalty term for the attention coefficient in the Self-attentive Graph Embedding (SAGE) method (Li et al. 2019). Inspired by the Transformer architecture, Graph Multiset Transformer (GMT) uses attention-based blocks to condense the graph into a few important nodes and consider their interaction (Vaswani et al. 2017; Baek et al. 2021). In this approach, the mapping of general nodes to important nodes is based on a multi-head attention block with key-value pairs, while a self-attentive function evaluates the interaction of significant nodes, and the important nodes are then mapped back to represent the entire graph. Another advantage of attention pooling is its capability to address the challenge of learning a fixed-size graph representation for graphs with varying dimensions, all while preserving permutation invariance (Meltzer et al. 2019; Chen et al. 2019a). In the Dual Attention Graph Convolutional Networks (DAGCN), the self-attention pooling layer learns several graph representations in different spaces and returns a fixed-size matrix graph embedding. Each row is one representation learned by weighted summation in one space, while the number of spaces is a tunable hyperparameter (Chen et al. 2019a).

3.1.4 Learnable readout functions

Attention-based weighted summation operators are a special subset of learnable readout functions (Wu et al. 2021a). All other approaches with trainable parameters, except learning a coefficient for each node, can be categorized as learnable readout functions. Hence, learnable graph pooling belongs to a group of global pooling with higher complexity, and it is difficult to express a general formula for the variety of implementation strategies. SortPooling is the first approach using learnable global pooling, it learns the graph presentation by cropping graph nodes to a fixed size and fed into a CNN (Zhang et al. 2018). SortPooling considers the last layer’s output to be the nodes’ most refined continuous Weisfeiler-Lehman (WL) colors, then sorts all nodes by these final colors and drops the nodes with lower ranks. In addition to cropping the node representation matrix, they use graph structure alignment to create uniformly sized graph representation matrices that align with the convolutional layers (Yuan and Ji 2020; Bai et al. 2021; Xu et al. 2022). Bai et al. present a node matching framework for transitively aligning the nodes of a family of graphs by gradually minimizing the inner-node-cluster sum of squares over all graph nodes, and this framework maps graphs of arbitrary sizes into fixed-sized aligned node grid structures (Bai et al. 2021). Another learnable pooling is an LSTM on a global representation, in which the network concentrates on one portion of the graph at a time and gradually puts its representation into memory (Lee et al. 2018). DKEPool expects to learn the distribution of node features in the graph through the Gaussian Manifold in the non-linear distribution space (Chen et al. 2022b). Murphy et al. propose an idealized framework, named Relational Pooling (RP), in which the representational power exceeds the WL isomorphism test by exhausting all permutations and then summing and averaging their representations (Murphy et al. 2019). Viewing readouts as learning over sets, Navarin et al. present a general formulation capable of encoding or approximating any continuous permutation-invariant functions over sets, facilitating the mapping from the set of node representations to a fixed-size vector (Navarin et al. 2019). However, due to the extreme complexity of this framework, its direct implementation is usually impractical, hence RP and Local Relational Pooling (LRP) propose computationally tractable or targeted approximation approaches (Murphy et al. 2019; Chen et al. 2021). Buterez et al. investigated the potential of adaptive readout provided by various neural networks on more than 40 datasets from different areas, and their experimental results demonstrate that constraints on permutation invariance can be relaxed in some specific tasks (Buterez et al. 2022).

3.2 Hierarchical pooling

Hierarchical graph pooling operators transform a graph into a coarsened graph with fewer nodes and edges. Hierarchical pooling is inspired by regular pooling in CNNs in grid structures, serves a similar role to downsampling, or is considered a neural network implementation for graph reduction, graph coarsening, and graph sparsification. Commonly, hierarchical pooling is used in conjunction with the global pooling operators to readout the coarsened graph. We categorize the hierarchical pooling operators into four groups based on the strategy used to construct the condensed graph, including clustering pooling, selection pooling, edge pooling, and hybrid pooling. Their representative methods are listed in Tables S5–S8 (Supplementary File 1), and each of these types of pooling is discussed in detail in Sects. 3.2.1 to 3.2.4.

3.2.1 Clustering pooling

Based on the assumption that each node is a member of a potentially significant substructure, clustering pooling transfers nodes from the input graph to nodes of the coarsened graph. The nodes in the coarsened graph can also be referred to as supernodes, and the coarsened graph can also be referred to as a hypergraph since each node represents a substructure of the original graph, where clustering pooling operators can benefit from existing community detection and graph clustering algorithms. A real-life application analogous to this assumption can be found in molecular structures, where functional groups are considered as communities of atoms, and molecules are assemblies of several functional groups. In addition to discovering the alignment between nodes and supernodes, clustering pooling requires learning the representation of supernodes and determining the links between supernodes. The last two could be summarized as learning hypergraphs, whose central task is node clustering. Formally, we describe a generic computational flow for clustering pooling in the following steps.

Step 1, Node Clustering: Given a potential hypergraph \({G}_{hyper}\) with node set \({V}_{hyper}\), we assume \({|V}_{hyper}| < |V|\) and define a surjection \({f}_{clus}: V\to {V}_{hyper}\), also known as the vertex mapping function or clustering function, in which each node in \(V\) corresponds to at least one node in \({V}_{hyper}\). The variation in the clustering function \({f}_{clus}\) results in a significant difference in clustering pooling. Furthermore, permutation invariance of the clustering functions is required for clustering pooling permutation invariance. An explicit or implicit cluster assignment matrix \({\varvec{S}}\in {\mathbb{R}}^{|V|\times |{V}_{hyper}|}\), with each row representing a node and each column representing a supernode, can be used to describe the outcome of node clustering. The pooling ratio \(r\) is defined as the number of clusters to the number of nodes, i.e.,\(r={|V}_{hyper}|/|V|\). The pooling ratio is a hyperparameter in both deterministic algorithms and parametric networks since it decides the size of the hypergraph, which impacts the computational complexity of the algorithm and the quantity of maintained information.

Step 2, Learning Hypergraph: Step 1 specifies which supernodes are included in the hypergraph, but it does not establish what features the supernodes have and how they are linked to one another. Hence, we revised the hypergraph created in the previous phase to add the two items listed below. Step 2.1, Learning the representation of supernodes: This step can be considered as a local readout operation, which can use any readout function described in Sect. 3.1. One of the classical methods is to use the cluster assignment matrix to conduct a weighted summation in Eq. (9), where \({{\varvec{Z}}}^{l+1}\) is the representation matrix of supernodes that has not yet performed message passing.

$$\begin{array}{c}{{\varvec{Z}}}^{l+1}={{\varvec{S}}}^{T}{{\varvec{H}}}^{l},{{\varvec{Z}}}^{l+1}\in {\mathbb{R}}^{{N}^{l+1}\times {d}^{l}}.\end{array}$$
(9)

Step 2.2, Learning the hyperedges: Intuitively, if two clusters are adjacent, at least one node in each cluster is adjacent to a node in the other cluster. Similarly, the transformation of a node’s adjacency matrix yields the adjacency matrix of a supernode. Hence, the hypergraph’s adjacency matrix \({{\varvec{A}}}^{l+1}\) can be calculated using the original graph’s adjacency matrix \({{\varvec{A}}}^{l}\) and the cluster assignment matrix \({\varvec{S}}\), as shown in Eq. (10).

$$\begin{array}{c}{{\varvec{A}}}^{l+1}={{\varvec{S}}}^{T}{{\varvec{A}}}^{l}S,{{\varvec{A}}}^{l+1}\in {\mathbb{R}}^{{N}^{l+1}\times {N}^{l+1}}\end{array}$$
(10)

In addition, pooling methods can also include task-specific procedures such as edge sparsification and cluster selection. Specifically, clustering pooling can be categorized into three types: graph clustering pooling, soft clustering pooling, and rule-based central node pooling.

3.2.1.1 Graph clustering pooling

Graph clustering pooling refers to the pooling operators employing deterministic algorithms of graph clustering, community detection, and graph topology. In the early stages, researchers employed pre-existing graph clustering techniques without modification, including hierarchical clustering, spectral clustering, and others (Bruna et al. 2014; Henaff et al. 2015; Defferrard et al. 2016; Monti et al. 2017). Defferrard et al. introduced the Graclus algorithm to generalize the CNN framework from low-dimensional regular grids to high-dimensional irregular graphs, which has been proven to be highly successful in clustering a large number of diverse graphs (Dhillon et al. 2007; Defferrard et al. 2016). Graclus has been widely used as a base algorithm for clustering pooling with its capability to calculate successive coarser versions of input graphs (Fey et al. 2018; Levie et al. 2019; Bianchi et al. 2022a). However, these pooling methods are not designed with contemporary neural network models, which limits their adaptability currently. EigenPooling is the first graph clustering pooling operator integrated with current GNN frameworks (Ma et al. 2019). It employs spectral clustering to obtain a controllable number of subgraphs while considering both local and global properties. HaarPooling is a spectral graph pooling operator that relies on compressive Haar transforms to filter out fine-detail information in the Haar wavelet domain, resulting in the creation of a sparse coarsened graph (Wang et al. 2020c; Zheng et al. 2023). More methods in spectral graph coarsening, graph reduction, and graph sparsification are reviewed elsewhere (Shuman et al. 2016; Loukas 2019; Bravo-Hermsdorff and Gunderson 2019). Tsitsulin et al. introduced DMoN (Deep Modularity Networks), an unsupervised module designed to optimize cluster assignments with the objective which combines spectral modularity maximization and collapse regularization (Tsitsulin et al. 2023). WGDPool (Weighted Graph Dual Pooling) is a graph clustering pooling algorithm that provides a differentiable k-means clustering variant, utilizing a Softmin assignment based on node-to-centroid distances (Xiao et al. 2024).

In order to obtain interpretable clustering results, CommPOOL provides a community pooling mechanism that captures the inherent community structure of the graph in an interpretable way, using an unsupervised clustering method Partitioning Around Medoids (PAM) on the node latent feature vectors (Tang et al. 2021). Other graph clustering pooling operators use only graph topology to discover node clusters, and these methods are usually non-parametric and make it easy to interpret the clustering results. Luzhnica et al. calculated the nodes’ maximal cliques using a modified Bron-Kerbosch algorithm (Luzhnica et al. 2019). SEP measures the complexity of hierarchical graph structure using structural entropy and globally optimizes hierarchical cluster assignment by minimizing structural entropy (Wu et al. 2022). KPlexPool is based on the concepts of graph covers and k-plexes, enabling a more flexible definition of cliques and guaranteeing the complete coverage of cliques on nodes. This ensures the clustering function is a surjection (Bacciu et al. 2021). Bacciu et al. proposed a clustering pooling method based on the Maximal k-Independent Sets (k-MIS) graph theory, which is designed to detect nodes that maintain a minimum distance of k from each other in a graph (Bacciu et al. 2023). Graph Parsing Network (GPN) utilizes a bottom-up graph parsing algorithm, similar to grammar induction, inferring clusters from nodes and learning personalized pooling structure for each graph (Song et al. 2024).

3.2.1.2 Soft clustering pooling

Soft clustering pooling refers to a differentiable pooling operation with learnable parameters in computing the cluster assignment matrix. DiffPool is the first soft clustering pooling operator, which learns a clustering assignment matrix using GraphSAGE (Ying et al. 2018). The soft cluster assignment matrix \({\varvec{S}}\) is calculated in Eq (11), and the adjacency matrix and node representation of the hypergraph is calculated in Eqs. (9 and 10):

$$\begin{array}{c}{{\varvec{S}}}^{l}=Softmax\left({GNN}_{pool}^{l}\left({{\varvec{A}}}^{l},{{\varvec{Z}}}^{l}\right)\right),\end{array}$$
(11)

where the \(Softmax(\cdot )\) function is applied in a row-wise fashion. It’s important to note that the assignment coefficient between any node and the supernode is between 0 and 1 in the soft assignment matrix. This contrasts the cluster assignment relationship derived by the deterministic algorithm, which usually only contains 0 or 1. This implies that the soft assignment matrix provides a more expressive representation of how a node can be assigned to multiple clusters with varying probabilities or contribute to forming multiple clusters at different degrees. DiffPool is widely adopted as the benchmark for differentiable pooling, and many follow-ups made their contributions to DiffPool, including replacement modules with more powerful GCN (Bandyopadhyay et al. 2020), parameters reduction with the merge of GCNs for learning representation and cluster assignment (Pham et al. 2021), and multi-channel pooling mechanisms (Zhou et al. 2020b; Liang et al. 2020). Ying et al. refined dense clustering pooling by incorporating persistent homology, simplifying the coarsened graphs (Ying et al. 2024). The key procedure entails resampling the adjacency matrix using Gumbel-softmax, applying persistence injection, and directing the training with a topological loss function.

Furthermore, conditional random fields (CRFs) and Non-Negative Matrix Factorization (NMF) can be applied in soft clustering pooling to capture node-cluster connections (Bacciu and Di Sotto 2019; Yuan and Ji 2020). Meanwhile, classical graph algorithms are also integrated into GNN systems; for example, GNN implementations of spectral clustering and Graph Mapper-based soft clustering are competitive in terms of theoretical and practical performance (Maria Bianchi et al. 2020; Bodnar et al. 2021). Unlike the above methods, graph classification methods based on graph capsule networks also incorporate the concept of clustering without defined explicit clusters and cluster assignment matrices (Xinyi and Chen 2018; Yang et al. 2021a). Node capsules are connected to graph capsules in a graph capsule network. Each graph capsule represents a meaningful substructure or node feature space, and the graph capsules are linked to classification capsules to provide classification results. Dynamic routing, which corresponds to the cluster assignment matrix in cluster assignment, is used to connect these capsules.

3.2.1.3 Rule-based central node pooling

The last category of clustering pooling is rule-based central node pooling, which generates clusters formed and filtered around a central node with predefined rules. The differences in these pooling strategies center on two fundamental questions: How are the central nodes decided? How are clusters formed and selected? One approach is to treat each node as a central node and create a cluster structure around it using specific criteria, often involving the node’s first-order or second-order neighborhood (Ranjan et al. 2020; Su et al. 2021; Yu et al. 2021, 2022; Li et al. 2022c). A popular approach is Adaptive Structure Aware Pooling (ASAP), which treats each node’s first-order neighbors as clusters, carries local soft cluster assignment learning, and then coarsens the graph using clusters as nodes again (Ranjan et al. 2020). Another option is to select a predefined number of nodes as central nodes using heuristic methods or topological metrics, such as degrees or locally normalized neighboring signals (Noutahi et al. 2019; Sun et al. 2021).

To generate clusters, a parametric approach involves mapping the non-central nodes into the clusters represented by the central nodes and learning an assignment matrix (Noutahi et al. 2019; Su et al. 2021). On the other hand, non-parametric approaches often identify multi-hop neighborhoods of nodes (Ranjan et al. 2020; Yu et al. 2021, 2022; Li et al. 2022c) or subgraphs (Sun et al. 2021) as clusters depending on the nodes’ local connectivity. All approaches that do not impose a limit on the number of central nodes require cluster filtering based on cluster fitness scores. The fitness score can be calculated using Local Extrema Convolution (LEConv), a graph convolution method for obtaining local extremum information (Ranjan et al. 2020; Li et al. 2022c). Other approaches, such as scalar projection (Sun et al. 2021; Yu et al. 2021) and ego-networks’ closeness scores (Zhong et al. 2022), can be used to evaluate a model’s coefficient in determining the role that clusters play in both structure and features. There are also a few operators that pick and collapse mergeable node pairs or node sets that are deemed merged one by one, driven by the notion that nodes with similar characteristics should belong to the same cluster and be merged. This is done without the need for an explicit cluster assignment matrix (Hu et al. 2019; Xie et al. 2020).

3.2.2 Selection pooling

Intuitively, the selection pooling reduces the size of the graph by removing some of the nodes. This graph sparsity strategy avoids dense coarsened graphs, which makes the model challenging in computational complexity and memory. The primary emphasis of the node selection pooling operator is to determine which nodes to retain and which to discard, typically accomplished through developing an evaluation mechanism for scoring the nodes. The edges among these preserved nodes in the original graph are also maintained to establish the topological connections between the supernodes in the hypergraph created through the selection pooling approach. The following steps describe the generic computational procedure in selection pooling.

Step 1, Node Evaluation: Given the node representation matrix \({\varvec{H}}\) and the adjacency matrix \({\varvec{A}}\), the selection pooling operator maintains a node evaluation function \({f}_{sel}:({{\varvec{h}}}_{v},{\varvec{H}},{\varvec{A}})\to {s}_{v}\) to transfer from the non-comparable node in the high-dimensional space to the fitness function value \({s}_{v}\in {\mathbb{R}}\). In selection pooling, the evaluation outcome of nodes is often represented as a projection vector \({\varvec{p}}\), similar to the cluster assignment matrix \({\varvec{S}}\) in clustering pooling, this selection function can also be stated as \({f}_{sel}:({\varvec{H}},{\varvec{A}})\to {\varvec{s}}\) and \(s\in {\mathbb{R}}^{N}\). The pooling ratio is a critical hyperparameter in selection pooling, as the removal of nodes inherently leads to information loss. The pooling ratio determines the extent to which essential node information is retained. According to the differences in the implementation of the fitness or importance evaluation functions, the evaluation functions could be generally categorized into parametric and non-parametric pooling.

Non-parametric selection pooling: One of the early pooling methods that incorporated the concept of node selection was implemented within an extended CNN architecture designed for graph data. The authors extensively explore the potential of graph signal sampling methods to compute node sampling matrices, which filter both nodes and node features (Gama et al. 2019). Parameter-free approaches often employ some kind of deterministic or heuristic metrics, such as degree, degree centrality (Zhang et al. 2021b), subgraph centrality (Ma et al. 2020), Manhattan distance between the node representation itself and the one constructed from its neighbors (Zhang et al. 2019a), neighborhood information gain (Gao et al. 2022a), correlation coefficient (Jiang et al. 2020), and the distance between the node and cluster center precomputed by K-means (Wang et al. 2022).

Parametric selection pooling: The first parametric approach can be traced back to graph pooling (gPool), which uses a trainable projection vector \({\varvec{p}}\) to project all node features to 1D and uses k-Max pooling to select nodes (Gao and Ji 2019). The scalar projection of a node \(v\) and its feature vector \({{\varvec{h}}}_{v}\) on \({\varvec{p}}\) is \({s}_{v}={{\varvec{h}}}_{v}{\varvec{p}}/||{\varvec{p}}||\), where \(||\cdot ||\) is the L2 norm (Cangea et al. 2018; Gao and Ji 2019, 2022; Qin et al. 2020; Bi et al. 2021; Gao et al. 2021a). To select coarse nodes, Ma et al. employed GNN to weight all nodes in the OTCoarsening (Ma and Chen 2021). Currently, many parametric approaches adjust attention mechanism to compute importance scores (Gao and Ji 2019; Lee et al. 2019; Knyazev et al. 2019; Huang et al. 2019; Qin et al. 2020; Li et al. 2020a; Gao et al. 2020; Aggarwal and Murty 2021; Bi et al. 2021; Duan et al. 2022). Recently, several approaches have emerged, employing diverse methodologies to concurrently compute and consider both node feature information and structural information while addressing aspects of local and global node features. Before weighted summation, Graph Self-adaptive Pooling (GSAPool) calculates the scores on node features and topology using MLP and GNN (Zhang et al. 2020), while MSAPool (Multiple Strategy-based Attention Pooling) interprets the results of GCN calculations are reflecting a local perspective, and the MLP calculations are interpreted as capturing a global perspective (Xu et al. 2022). Before projection, UGPool executes a simple message passing to gather local feature information and calculate node scores \({{\varvec{s}}}^{l}={\widehat{{\varvec{A}}}}^{l}{{\varvec{H}}}^{l}{{\varvec{p}}}^{l}/||{{\varvec{p}}}^{l}||\), where \({\widehat{{\varvec{A}}}}^{l}\) denotes the normalized adjacency matrix (Qin et al. 2020).

Non-parametric and parametric approaches are not mutually exclusive, and some pooling operators support both parametric and non-parametric modules (Nouranizadeh et al. 2021; Stanovic Stevan and Gaüzère 2022). KnnPool first projects the nodes using MLP and GCN, and then selects nodes by computing the distance between the node and the cluster center (Chen et al. 2022a). Topology-Aware Pooling (TAP) considers two voting processes: local voting and global voting (Gao et al. 2021a). Local voting is based on the average of node and neighboring node similarity metrics, whereas global voting employs projection vectors. MVPool (Multi-view Graph Pooling) offers three views: a structure specific view based on node degree centrality, a feature specific view based on node features and MLP, and a structure and feature specific view based on a variation of PageRank (Zhang et al. 2021b). Gao et al. proposed a structure-aware kernel representation for evaluating node similarity in a graph topology view and implemented it as a parametric and learnable approach (Gao et al. 2021b).

Step 2, Node Selection: After obtaining the node information scores, the following step is to choose which nodes should be kept. In this process, nodes are re-ordered based on their fitness scores, and then a subset of the top-ranked nodes is selected. Equation (12) represents this process, where \(idx\) denotes the indices of selected nodes, the pooling ratio \(r\) and the number of graph nodes \(N\) decide the number of selected nodes, which is usually greater than or equal to 1:

$$\begin{array}{c}idx=top\_rank\left({{\varvec{s}}}^{l},\lceil r\times {N}^{l}\rceil\right)\end{array}$$
(12)

The Multidimensional score space with flIpscore and Dropscore operations (MID) focus on the node selection module and address the issue of neglecting node features and graph structure diversity by incorporating flipscore and dropscore operations (Liu et al. 2023). Specifically, the flipscore operation yields the absolute value of all elements in the multi-dimensional score matrix, while the dropscore operation randomly discards a certain proportion of nodes in the graph when selecting the top k nodes.

Step 3, Learning Hypergraph: In selection pooling, learning hypergraphs involves two aspects: the representation of the hypernodes and the adjacency matrix, which can usually be accomplished by extracting the original representation matrix and the adjacency matrix. This process can be described by Eqs. (13 and 14), in which \({\varvec{H}}(idx,:)\) and \({{\varvec{A}}}^{l}(idx,idx)\) perform the row or (and) column extraction. Concomitantly, the evaluation scores can act as gated coefficients, influencing the update of each supernode’s representation as shown in Eq. (15), where \(\odot \) denotes the element-wise broadcast operation by row. This reflects the idea that more important nodes should be kept more integrally, even if they are all selected supernodes.

$$\begin{array}{c}{{\varvec{Z}}}^{l+1}={{\varvec{H}}}^{l}\left(idx,:\right),{{\varvec{Z}}}^{l+1}\in {\mathbb{R}}^{{N}^{l+1}\times {d}^{l}}\end{array}$$
(13)
$$\begin{array}{c}{{\varvec{A}}}^{l+1}={{\varvec{A}}}^{l}\left(idx,idx\right),{{\varvec{A}}}^{l+1}\in {\mathbb{R}}^{{N}^{l+1}\times {N}^{l+1}}\end{array}$$
(14)
$$\begin{array}{c}{{\varvec{Z}}}^{l+1}={{\varvec{s}}}^{l}(idx)\odot{{\varvec{H}}}^{l}\left(idx,:\right)\end{array}$$
(15)

From the observation, the selection pooling operators are relatively computationally constant, and the sent and received data are almost identical. As a consequence, the procedures in each step can be abstracted and optimized by a neural architecture search Pooling Architecture Search (PAS) (Wei et al. 2021). Chen et al. proposed the graph self-correction (GSC) mechanism to reduce information loss during graph coarsening by compensating information generated by feedback procedures, where compensating information is calculated using complement graph fusion and coarsened graph back-projection (Chen et al. 2022c). There is also a subset of parametric methods that use mutual information to train auxiliary encoder modules: Coarsened Graph Infomax Pooling (CGIPool) maximizes the mutual information between the input and the coarsened graph (Pang et al. 2021b), and Vertex Infomax Pooling (VIPool) determines the importance of a node by estimating the mutual information between nodes and their neighborhoods (Li et al. 2020b). All of the approaches described above provide the scoring vectors that could be used to visualize the node ordering. Node Decimation Pooling (NDP) is an intriguing node selection pooling operator that decimates nodes by eliminating one of the two sides of the MAXCUT partition and connects the remaining nodes using a link creation procedure (Bianchi et al. 2022b). Without ranking the fitness scores of individual nodes, the NDP evenly divides all nodes into two clusters, where nodes from one cluster are retained while nodes from the other cluster are decimated.

3.2.3 Edge pooling

Edge pooling is a class of pooling operations that perform graph coarsening by mining the edge features. Edge pooling can be divided into two categories based on the edge-operating strategy: edge contraction and edge deletion. The difference between these two edge pooling categories is comparable to that between clustering pooling and selection pooling.

3.2.3.1 Edge contraction

Edge contraction typically necessitates the following steps: identifying edges that can be contracted, performing edge contraction (merging nodes), and keeping graph connectivity. EdgePool is the first edge-contraction-based edge pooling operator that can be integrated into existing GNN systems (Diehl et al. 2019; Diehl 2019). The score of the edge \({e}_{ij}\) between nodes \({v}_{i}\) and \({v}_{j}\) is learned by node features, as in:

$$\begin{array}{c}{s}_{i,j}=\sigma (W({{\varvec{h}}}_{{v}_{i}}||{{\varvec{h}}}_{{v}_{j}})+b),\end{array}$$
(16)

where \({\varvec{W}}\) and \({\varvec{b}}\) are learnable parameters, and \(\sigma (\cdot )\) could be \(Tanh(\cdot )\) or \(Softmax(\cdot )\). All edges are ranked by their scores, and the highest-scoring edge not yet connected to a contracted node is selected sequentially. While the edge contraction is viewed as a readout between two nodes, the merged node’s features are the sum of the two nodes’ features as \({{\varvec{h}}}_{{v}_{ij}}={s}_{i,j}({{\varvec{h}}}_{{v}_{i}}+{{\varvec{h}}}_{{v}_{j}})\) (Diehl et al. 2019; Yuan et al. 2020).

3.2.3.2 Edge deletion

Equation (16) can also be used to score edges in edge deletion pooling (Galland and marc lelarge 2021). The hyperedge pooling operator is a common pooling operator based on edge deletion, which propagates node projection scores through PageRank and averages the scores of two endpoints as the edge’s evaluation scores (Zhang et al. 2021c). Another evaluation viewpoint considers the observation that the greater the difference between the endpoints, the more information they can obtain from each other, and the greater the importance of this edge, as shown in Eq. (17) (Gao et al. 2020). After obtaining the edge’s evaluation score, the processes of edge deletion pooling are similar to node selection pooling, with the exception that the operation subject is switched from the node to the edge, as shown in Eqs. (18 and 19), where \(top\_rank(\cdot ,\cdot )\) returns the indices of selected edges and \(E(idx)\) means removing edges that are out of \(idx\) (Zhang et al. 2021c; Yu et al. 2022).

$$\begin{array}{c}{s}_{i,j}={e}^{\Vert {h}_{{v}_{i}-}{h}_{{v}_{j}}\Vert }\end{array}$$
(17)
$$\begin{array}{c}idx=top\_rank\left({{\varvec{s}}}^{l},\lceil r\times \left|{E}^{l}\right|\rceil\right)\end{array}$$
(18)
$$\begin{array}{c}{E}^{l+1}={E}^{l}\left(idx\right)\end{array}$$
(19)

Dual Hypergraph Transformation (DHT) is a unique technique for edge pooling that transforms edges into hypergraph nodes (Jo et al. 2021). Applying the clustering and dropping methods on dual hypergraphs, both coarsened graphs and global edge representation are generated in DHT. In addition, graph contraction and graph deletion are not mutually exclusive, and they can be described under a unified framework (Bravo-Hermsdorff and Gunderson 2019).

3.2.4 Hybrid pooling

Hybrid pooling is a hierarchical pooling framework that concurrently employs two or more pooling strategies from clustering pooling, selection pooling, and edge pooling. Many rule-based central node pooling operators can be classified as hybrid pooling methods, as they create the cluster structure by scoring and sorting central nodes or clusters to reduce the number of supernodes. Approaches characterized as sequential selection and clustering involve first selecting the nodes as clustering centers, followed by clustering the remaining nodes. (Noutahi et al. 2019). Interleaved clustering and selection can describe certain methods where the initial formation of local clusters is conducted first, followed by the selection of clusters, and then the subsequent refinement of cluster assignments along with cluster readout (Ranjan et al. 2020; Su et al. 2021; Sun et al. 2021; Yu et al. 2021; Li et al. 2022c).

Another type of node selection pooling is the sequential selection-clustering hybrid pooling method that aggregates the information of discarded nodes into supernodes, followed by performing an intra-cluster readout. The primary motivation of these pooling operators is to address the challenge that selection pooling usually discards a significant amount of information, including node attributes and topology information, as they coarsened graphs by removing a large number of nodes. A natural option is to aggregate the node information from local neighbors into the supernode (Huang et al. 2019; Zhang et al. 2020; Qin et al. 2020; Li et al. 2020b, a; Bi et al. 2021). ProxPool decomposes the relearning of the supernode representation into three levels, resembling a method for constructing a cluster assignment matrix: neighborhoods within fixed hops, sparse neighborhoods retaining only closely related non-reserved nodes, and sparse soft assignment matrix based on affinity treating each supernode as the cluster’s central node (Gao et al. 2021b).

Selecting both edges and points are also introduced by researchers. LookHops and Hierarchical Triplet Attention Pooling (HTAP) both compute the indices of important points and edges independently and simultaneously (Gao et al. 2020), whereas HTAP only selects the important edges for the important nodes (Bi et al. 2021). Zhou et al. proposed cross-view graph pooling (Co-Pooling), which involves incorporating pooled representations learned from node and edge views, as well as exchanging the cut proximity matrix and the indices of selected nodes as edge-node view interactions (Zhou and Yin 2023). Accurate Structure-Aware Graph Pooling (ASPool) combines the concepts of clustering pooling, selection pooling, and edge pooling to create coarsened graphs by removing edges to calibrate the graph structure, forming local clusters, scoring and selection through a two-stage procedure, and merging selection results (Yu et al. 2022). WGDPool integrates edge weight to enhance graph representations, learning separate node and edge embeddings that converge into a comprehensive graph representation (Xiao et al. 2024).

3.3 Graph unpooling

Graph unpooling serves as the inverse operation of graph pooling, which is similar to how deconvolution is the reverse process of convolution, and upsampling is the inverse operation of downsampling. These unpooling operators may have other names, such as the Graph Refining Layer (Hu et al. 2019) or the Up-Sampling Layer (Zhang et al. 2021b). The graph unpooling operator converts the coarsened graph back to the fine graph, which conducts the upsampling procedure. This U-shaped network structure enables the network to handle layer-level and node-level tasks concurrently and is generally bound to the unpooling operator. The unpooling concept was first proposed in graph U Nets with its unpooling operator as gUnpool (Gao and Ji 2019, 2022). Generally, each unpooling operator has two steps: restoring node location and restoring node representation.

To retrieve the graph to its original structure, the location of the node selected in the corresponding pooling layer needs to be recorded, and the nodes are repositioned using this information. This operation can be formalized as:

$$\begin{array}{c}{{\varvec{U}}}^{l+1}=distribute\left({0}_{{N}^{l+1}\times {d}^{l}},{{\varvec{U}}}^{l},idx\right),\end{array}$$
(20)

where \({\varvec{U}}\) is the restored graph representation matrix, \({0}_{{N}^{l+1}\times {d}^{l}}\) is the restored graph’s initial empty representation matrix, and \(distribute({0}_{{N}^{l+1}\times {d}^{l}},{{\varvec{U}}}^{l},idx)\) is the operation that distributes \({{\varvec{U}}}^{l}\) row vectors into an empty matrix based on indices stored in \(idx\) (Gao and Ji 2019, 2022). As shown in Eq. (20), the indices of selected nodes are stored for unpooling. Currently, the inverse of selection pooling operators are the most popular unpooling operators (Gao and Ji 2019, 2022; Li et al. 2020b; Zhang et al. 2021b; Chen et al. 2022a; Zou et al. 2022; Lu et al. 2022). Furthermore, if the node mapping from the original graph to the coarsened graph is saved in graph pooling, edge pooling can directly restore through inverse mapping, and the restored node features can be calculated through Eq. (21) (Diehl 2019; Yuan et al. 2020). The mapping relationship is preserved within the cluster assignment matrix \({\varvec{S}}\) for cluster clustering operators, allowing it to be utilized to restore the graph structure. In addition, skip connections could be used to improve node representations, as shown in Eq. (22) (Hu et al. 2019).

$$\begin{array}{c}{{\varvec{u}}}_{{v}_{i}}^{l+1}={{\varvec{u}}}_{{v}_{j}}^{l+1}=\frac{{{\varvec{u}}}_{{v}_{ij}}^{l}}{{s}_{i,j}^{l}}\end{array}$$
(21)
$$\begin{array}{c}{{\varvec{U}}}^{l+1}={{\varvec{S}}}^{l}{{\varvec{U}}}^{l}+{{\varvec{Z}}}^{l+1}\end{array}$$
(22)

The main distinction between different unpooling operators is how the node representation is restored. After initializing the restored graph’s node representation matrix, node representation can be interpolated using the graph convolution layer (Li et al. 2020b). The attention-based graph unpooling (attnUnpool) layer initializes added nodes with an attention operator that attends to its neighbors (Gao and Ji 2022). MeanUnPooling, inspired by bilinear interpolation in the CNN model, restores node features by averaging features of neighbor nodes without any hyperparameters (Lu et al. 2022). Zhong et al. perform the unpooling by creating a top-down message-passing mechanism that provides the restored nodes with meso/macro level knowledge (Zhong et al. 2022). The parameterized unpooling layer (UL), on the other hand, uses MLP to produce probabilities to determine whether nodes and edges should be restored, as well as to construct features for the restored nodes and edges (Guo et al. 2023).

3.4 Evaluation frameworks and open problems

3.4.1 Evaluation of graph pooling

In this section, we will discuss some ground-breaking research on evaluating pooling operators, including whether and how pooling operators improve graph classification.

3.4.1.1 Benchmark datasets

Table 1 presents benchmark datasets for graph classification, notably the TUDataset with over 120 benchmarks for graph data learning, accessible at www.graphlearning.io (Morris et al. 2020). Popular bioinformatics datasets include D&D, NCI1, NCI109, and others, while social network and collaboration datasets like IMDB-BINARY and COLLAB are also frequently used (Yanardag and Vishwanathan 2015; Kersting et al. 2016; Morris et al. 2020). Accuracy is the primary evaluation metric for classification datasets in the TUDataset. The Open Graph Benchmark offers large-scale datasets across various domains, available at https://ogb.stanford.edu, supporting diverse ML tasks (Hu et al. 2020, 2021c). It includes datasets like Ogbg-molhiv (HIV), Ogbg-ppa (PPA), Ogbg-molbbbp (BBBP), Ogbg-moltox21 (Tox21), and Ogbg-moltoxcast (ToxCast), evaluated using ROC-AUC or accuracy (Gao et al. 2021b; Baek et al. 2021; Jo et al. 2021; Chen et al. 2022b). Additionally, Dwivedi et al. introduced a benchmark framework that encompasses a diverse range of mathematical and practical graphs for fair model comparisons (Dwivedi et al. 2023). Their framework is hosted at https://github.com/graphdeeplearning/benchmarking-gnns.

Table 1 A summary of commonly used datasets
3.4.1.2 Libraries

With the growth of the field, several GNN-specific libraries has emerged, spanning languages like Python and Julia, and platforms such as PyTorch, TensorFlow, JAX, and Flux. Table 2 presents a summary of these libraries alongside their supported pooling operators. These libraries, initially designed for simpler node operations, are evolving through recent innovations to offer user-friendly interfaces for diverse GNN models, optimize sparse operations on GPUs, and facilitate scaling to expansive graphs and multi-GPU environments. They provide versatile APIs for hierarchical pooling and readout, with some offering an array of pooling options. Additionally, TUDataset is accessible via PyTorch Geometric (PyG), Deep Graph Library (DGL), and Spektral, while OGB is accessible via PyG, and DGL.

Table 2 A summary of libraries
3.4.1.3 Comparison and discussion

Cheung et al. conducted empirical evaluations of various graph pooling methods, including SortPooling (Zhang et al. 2018), DiffPool (Ying et al. 2018), gPool (Gao and Ji 2019), and SAGPool (Lee et al. 2019), within graph classification tasks using GCNs (Cheung et al. 2019). They found that DiffPool outperformed non-pooling networks and other pooling, while the performance of other pooling operators is unstable, and gPool performs poorly without the encoder structure. The optimal evaluation protocol for graph pooling—whether a uniform GNN architecture or one tailored to each pooling—remains contested (Bodnar et al. 2021). To promote experimental integrity, a standardized and reproducible experimental environment was suggested, which includes nested cross-validation (CV), publicly accessible data splits, and hyper-parameter tuning procedures (Errica et al. 2020). In a controlled experiment, the authors reevaluated five GNNs across nine datasets against structure-agnostic baselines that rely solely on node features and global readouts (Hamilton et al. 2017; Simonovsky and Komodakis 2017; Zhang et al. 2018; Ying et al. 2018; Xu et al. 2019). The findings indicate discrepancies between the actual performance of each model and previously reported results. Notably, graph pooling methods did not consistently outperform the structure-agnostic baselines, despite their purported advantages in leveraging graph structures. Bianchi et al. focused on how pooling operators influence the expressiveness of GNNs and proposed a universal criterion to measure a pooling operator’s effectiveness based on its ability to retain graph information (Bianchi and Lachi 2024). Their experimental evaluation on graph classification benchmarks revealed that expressive pooling operators outperformed, and most sparse pooling methods were not only less effective due to their limited expressiveness but also did not offer significant speed advantage.

In addition to structurally agnostic baseline methods and reliable experimental settings, comparative experiments with variants designed based on randomization serve as another way to validate the model’s real effectiveness (Mesquita et al. 2020). Mesquita et al. examined the idea of capturing local information and conducted extensive experiments with variations, named randomization and complement, on the need for locality-preserving representations (Mesquita et al. 2020). Grattarola et al. offered three evaluation criteria for pooling operators: preservation of (1) node attributes, (2) topological structure, and (3) information for downstream tasks (Grattarola et al. 2022). They applied these criteria using three experimental metrics to evaluate eight pooling methods, focusing on reconstruction of point cloud coordinates, structural similarity between the original and coarsened graphs, and classification performance on benchmark datasets (Monti et al. 2017; Cangea et al. 2018; Ying et al. 2018; Gao and Ji 2019; Lee et al. 2019; Noutahi et al. 2019; Bacciu and Di Sotto 2019; Maria Bianchi et al. 2020; Bianchi et al. 2022b; Grattarola et al. 2022). The findings reveal that trainable methods have an advantage in preserving the structure and task-specific information. Furthermore, the authors note that trainable global pooling performs better, which is consistent with our ranking of the learning ability of various types of readout functions. Zhou et al. randomly added and removed edges from a real dataset to test the robustness of graph topology on graph classification in existing methods (Zhou and Yin 2023). Surprisingly, processed random edges did not significantly reduce graph classification accuracy, even when all edges were removed (Zhou and Yin 2023). These discoveries have encouraged researchers to conduct more ablation studies to validate the effectiveness of novel pooling operators.

The interpretability of the pooling operators enhances our understanding of graph pooling. Visualization of hierarchical clustering has been an attractive solution to visibly demonstrate model findings until the explanation of graph pooling and the possible meanings of the captured structures are clearly defined (Ying et al. 2018; Noutahi et al. 2019; Maria Bianchi et al. 2020). CommPOOL generalizes hierarchical graph neural network interpretation to three questions (Tang et al. 2021): How can the hierarchical structures of a graph be captured in an interpretable manner? How can the graph representation be scaled down while preserving the structures using an interpretable process? What results from the pooling operation? To examine the community structure captured by CommPOOL, the authors employed random simulation graphs and protein data with node labels as community ground truth. The Normalized Mutual Information (NMI), Adjusted Mutual Information (AMI), and Adjusted Rand Index (ARI) between the model-predicted community labels and the ground truth are used to quantify the pooling operator’s capability to capture the community structure (Maria Bianchi et al. 2020; Zhang et al. 2021b; Roy et al. 2021; Tang et al. 2021; Wang et al. 2022).

3.4.1.4 Evaluation framework

We categorized the evaluation of pooling methods into three tiers: (1) fair experimental settings with repeatable CV, data partitioning, and hyperparameter selection procedures; (2) comprehensive ablation studies and comparison experiment; and (3) heuristic or quantitative model interpretation analysis with theoretical insights. All methods were systematically reviewed and detailed in Table S10 (Supplementary File 1). In general, most methods include a detailed experimental setup. For developing an optimum evaluation procedure, the hyperparameter setup or hyperparameter search process is also detailed in the research settings. For the evaluation criterion of comprehensive ablation studies and comparison experiments, it is essential to conduct parameter analysis and comparison experiments with model variations based on candidate techniques. In addition to comparing against variants with specific modules removed, ablation studies may also consider comparisons with variants generated through randomization. Studies validating methods’ runtime and memory usage underscore that consistent time and space conditions provide an objective efficiency metric. More efficient feature encoding approaches mean more information may be represented with the same number of neurons or parameters. The performance of convergence is another comparison for neural network models. Visualizations, case studies, model explanations, permutation invariance proofs, connections to existing methods, theoretical discourse, and other insightful discussions constituted the third evaluation tier. Popular forms of visualization include examples of selection or cluster structures, cluster visualization of graphs in low-dimensional space, or the coloring diagrams of importance scores. Liu et al. and Xiao et al. use image segmentation as an example of interpretability studies to demonstrate the interpretability of node assignment (Liu et al. 2022b; Xiao et al. 2024). Permutation invariance, essential in global pooling, is also demonstrated across various hierarchical pooling operators. Many methods include a time or space complexity analysis as a theoretical assessment of their effectiveness for sparsity or other properties.

3.4.2 Complexity and connectivity

The sparsity of the generated coarsened graph differs significantly between clustering pooling and selection pooling. Differentiable clustering pooling operators often describe the clustering results using a cluster assignment matrix. Because this matrix is typically dense, the adjacency matrix of the generated hypergraph will also be dense, regardless of whether the initial adjacency matrix is sparse. This dense structure imposes unsustainable computing and storage requirements as the size of the input graph rises, preventing the deployment of such pooling methods to large graphs and deeper networks. The updated adjacency matrix is produced in the node selection pooling operators by extracting the initial adjacency matrix, which effectively preserves the sparsity of the graph structure. However, this strategy misses the connectivity among the supernodes, which may result in isolated nodes that are not adjacent to any node in the hypergraph. These isolated nodes may exhibit local extremum-like effects during subsequent message propagation, weakening the validity of node evaluation. Consequently, specific modifications of the graph’s adjacency matrix are required to preserve the graph structure’s sparse properties and robust connectivity.

Diversified sparsification strategies for hypergraph adjacency matrices have been proposed for acceptable space overhead and complexity. A sparsification technique is to restrict the number or range of node and cluster assignment connections, i.e., local clustering assignment generates a sparse assignment matrix (Xie et al. 2020; Ranjan et al. 2020; Gao et al. 2021b; Hou et al. 2024). Local clustering, in general, means that only nodes in a multi-hop neighborhood can be assigned to the same cluster (Ranjan et al. 2020; Gao et al. 2021b; Yu et al. 2021; Li et al. 2022c). In contrast, in differentiable clustering pooling with no restrictions, any two points could be assigned to the same cluster. Meanwhile, the Sparsemax function is used as the normalization function to enable a sparse assignment (Noutahi et al. 2019; Gao et al. 2021b; Zhang et al. 2021b). The Sparsemax function is a normalization transformation that adaptively sets a threshold for input vectors and transfers the elements below the threshold to zeros after normalization (Martins and Astudillo 2016). To construct sparse attention mechanisms, the Sparsemax function can be used to replace the Softmax function (Zhang et al. 2021b). As a workaround, Liu et al. use the Gumbel-Softmax to perform soft sampling in the node neighborhood, resulting in a lower edge density for the sampled adjacent matrix in Hierarchical Adaptive Pooing (HAP) (Liu et al. 2021). Liu et al. also use Gumbel-Softmax to convert the soft bridge matrix (i.e., assignment matrix) into a hard-assign matrix in the SMIP (Liu et al. 2022b).

In selection pooling, the local neighbors of a node can also be utilized to maintain the hypergraph connectivity and sparsity (Ma and Chen 2021). In general, if the neighborhoods of two nodes in the original graph have dense interconnections, they tend to have larger edge weights in the coarsened graph (Sun et al. 2021; Gao et al. 2022a). The distance between the center node and its one-hop neighbor nodes in the original graph could be employed to compute the node distance in the coarsened graph, as shown in Eqs. (23 and 24) (Huang et al. 2019). Gao et al. suggest using the \({2}^{nd}\) graph power to boost graph connectivity, an operation that builds links between nodes separated by no more than two hops (Gao and Ji 2019). Eqs. (25 and 26) describe this strategy. Eq. (27) shows that UGPool uses a similar adjacency matrix updating method (Qin et al. 2020). Another option is to incorporate graph connectivity into measures for evaluating nodes, such as node degree after normalization, and to prefer densely connected nodes with better connectivity (Gao et al. 2021a).

$$\begin{array}{c}{{\varvec{A}}}_{sel}^{l}={{\varvec{A}}}^{l}\left(idx,:\right),{{\varvec{A}}}_{sel}^{l}\in {\mathbb{R}}^{{N}^{l+1}\times {N}^{l}}\end{array}$$
(23)
$$\begin{array}{c}{{\varvec{A}}}^{l+1}={{\varvec{A}}}_{sel}^{l}{{\varvec{A}}}^{l}{{{\varvec{A}}}_{sel}^{l}}^{T}\end{array}$$
(24)
$$\begin{array}{c}{{\varvec{A}}}^{2}={{\varvec{A}}}^{l}{{\varvec{A}}}^{l}\end{array}$$
(25)
$$\begin{array}{c}{{\varvec{A}}}^{l+1}={{\varvec{A}}}^{2}\left(idx,idx\right)\end{array}$$
(26)
$$\begin{array}{c}{{\varvec{A}}}^{l+1}={{{\varvec{A}}}^{l}\left(idx,idx\right)+{\varvec{A}}}^{2}\left(idx,idx\right)\end{array}$$
(27)

Although the abovementioned methods could enhance the graph’s connectivity, they may still result in a few isolated nodes. A more sophisticated method would be to relearn the connections for the coarsened graph and combine this with sparsification to guarantee that the learned adjacency matrix is sparse and well-connected (Zhang et al. 2019a, 2021b; Bianchi et al. 2022b). The structure learning mechanism refines the graph structure and eliminates undesired noise information by using a sparse attention mechanism on the original graph structure (Zhang et al. 2019a, 2021b). The structure learning mechanism consists of the following steps: constructing a single-layer neural network \({\varvec{a}}\) to transform the node representation, calculating the similarity score of two nodes using the attention mechanism \(AM({v}_{i},{v}_{j})\) as shown in Eq. (28), and normalization and sparsification using the \(sparsemax(\cdot )\) function. The adjacency matrix value for node pairs, \({A}_{ij}^{l}\), is integrated into the structure learning layer to give attention to larger similarity scores between directly connected nodes. It also attempts to learn potential pairwise relationships between unconnected nodes for which \(\lambda \) is a trade-off parameter. NDP uses Kron reduction to generate new Laplacian matrices and recover them into adjacency matrices for coarsened graph link construction, as well as a threshold to truncate the adjacency matrix for graph sparsification (Bianchi et al. 2022b).

$$\begin{array}{c}AM({v}_{i},{v}_{j})=\sigma ({{\varvec{a}}}^{T}[{{\varvec{h}}}_{i}^{l}||{{\varvec{h}}}_{j}^{l}])+\lambda {A}_{ij}^{l}\end{array}$$
(28)

3.4.3 Adaptivity

Another issue in hierarchical pooling is adaptivity, which relates to the pooling method’s capacity to handle input graphs of varying sizes. Extensive graph classification differs from graph signal processing in that the unclassified graphs may exhibit diverse graph structures and node features. This necessitates the network’s capacity to handle different graphs in batches and transform them into fixed-sized outputs. Specifically, the pooling operator maps the original graph to a fixed number of supernodes, thus achieving an alignment of the graph structure (Bai et al. 2021; Lee et al. 2021). The adaptivity also links to the network’s classification capacity on how spatial graph convolution with pooling operators expands to graph structures not observed in the training set, where the number of supernodes influences the adaptivity of the GNNs. When the number of supernodes is independent of the size of the input graph, the model extracts the same number of substructures for large and small graphs; when they are positively correlated, complex graphs can be represented by moderately complicated coarsened graphs. The pooling ratio \(r\) is a crucial hyperparameter that reflects the correlation between the size of the supernode and the original graph in implementation. In selection pooling, the pooling ratio is intuitively defined by the number \(k\) of selected nodes in the top-rank function. The pooling ratio in clustering pooling methods usually equals the number of clusters.

Currently, the most common solution among pooling operators is to determine the number of supernodes by selecting a certain percentage of the maximum number of nodes in the graph data (Cangea et al. 2018; Ying et al. 2018; Gao and Ji 2019; Lee et al. 2019; Ma et al. 2019; Huang et al. 2019; Yuan and Ji 2020; Ranjan et al. 2020; Zhang et al. 2020, 2021b, c; Qin et al. 2020; Li et al. 2020b, a, 2022b; Bandyopadhyay et al. 2020; Gao et al. 2020, 2021b, 2022a, 2022b; Aggarwal and Murty 2021; Liu et al. 2021; Sun et al. 2021; Yang et al. 2021a; Bodnar et al. 2021; Pang et al. 2021b; Yu et al. 2021, 2022; Pham et al. 2021; Bi et al. 2021; Tang et al. 2021; Su et al. 2021; Wang et al. 2022; Xu et al. 2022; Duan et al. 2022; Zhou and Yin 2023). Previously, the number of supernodes was constant for all graphs, might not share across multiple layers, and might have varying pooling ratios (Ma et al. 2020; Maria Bianchi et al. 2020; Khasahmadi et al. 2020; Xie et al. 2020). Some pooling methods, such as edge contraction pooling (Diehl et al. 2019), and selection pooling (Bianchi et al. 2022b), achieve a fixed pooling ratio, generating a coarsened graph with 50% of the nodes at a time. Other hyperparameters that impact the adaptivity of the graph pooling operators are present in certain topology or heuristic rule-based methods, such as the number of pooling layers in SEP (Wu et al. 2022), and the maximum number of missing links for nodes in a clique in KPlexPool (Bacciu et al. 2021).

The pooling operators with optimal adaptivity should determine the number of supernodes for each sample based on its unique structure and features rather than artificially pre-selecting the number of supernodes for all samples. Inspired by the graph reduction algorithms, the graph coarsening layer in Hierarchical Graph Convolutional Network (H-GCN) merges nodes into structural equivalence groupings and structural similarity groupings until all nodes are marked, and subsequently, it constructs a cluster assignment matrix to elucidate the merging process (Hu et al. 2019). While AdamGNN’s adaptive graph pooling (AGP) operator merges low-diameter ego-networks adaptively and recursively to construct a supernode that contains these ego-networks (Zhong et al. 2022). Leveraging iterative adaptive community detection algorithms, such as the Louvain algorithm, is a detached strategy for fully adaptive graph pooling operators (Roy et al. 2021). To tackle the challenge of prior knowledge in node sampling, Sun et al. propose a novel reinforcement learning (RL) algorithm for adaptive updating of the pooling ratio \(r\) (Sun et al. 2021). When adapted to deep learning, the maximal independent vertex set (MIVS) and maximum weight independent set (MWIS) algorithms allow for node selection without a predetermined ratio (Nouranizadeh et al. 2021; Stanovic Stevan and Gaüzère 2022). A more adaptable and practical solution involves using a threshold to select the optimal number of supernodes for each sample dynamically. Noutahi et al. provide an alternative method in Laplacian Pooling (LaPool) that dynamically selects nodes with stronger signal variation than their neighbors, offering the unique flexibility of defining clustering dynamically when training graphs sequentially (Noutahi et al. 2019). More generally, when using pooling operators that use evaluation values to pick nodes, it is advisable to assign a threshold \(\widetilde{s}\) such that only nodes with evaluation value \({s}_{v}>\widetilde{s}\) are preserved (Knyazev et al. 2019).

3.4.4 Additional loss

The classification loss function, such as the cross-entropy loss function, is the most popular loss function for graph pooling operators. Nevertheless, when the training objective involves extra constraints or the model faces convergence challenges, incorporating additional loss functions becomes essential. Diffpool (Ying et al. 2018) defines the link prediction objective and entropy regularization as auxiliary loss functions, as shown in:

$$\begin{array}{c}{loss}_{LP}={\Vert {{\varvec{A}}}^{l},{{\varvec{S}}}^{l}{{{\varvec{S}}}^{l}}^{T}\Vert }_{F},\end{array}$$
(29)
$$\begin{array}{c}{loss}_{E}=\frac{1}{{N}^{l}}\sum\limits_{i=1}^{{N}^{l}}{EF({\varvec{S}}}_{i}^{l}),\end{array}$$
(30)

where \({\Vert \cdot \Vert }_{F}\). denotes the Frobenius norm and \(EF(\cdot )\) denotes the entropy function. The link prediction objective follows the intuition that nearby nodes need to be pooled together. The entropy regularization highlights another important characteristic of pooling GNNs: ensuring that each node’s cluster assignment closely resembles a one-hot vector, thus clearly defining each cluster or subgraph membership. Other differentiable pooling operators that share similar architectures also incorporate these objectives to aid in training (Pham et al. 2021; Gao et al. 2021a). To boost the training process, AttPool attaches an MLP to each pooling layer, which takes the graph embedding as input and predicts the graph labels (Huang et al. 2019). The losses and predictions at different levels are summarized to get the total classification loss and the final prediction. Su et al. also designed pooling information loss to maintain node representation distributions as consistently as possible both before and after pooling, as seen below (Su et al. 2021):

$$\begin{array}{c}{loss}_{stability}={\Vert {{{\varvec{H}}}^{l}}^{T}{{\varvec{H}}}^{l}-{{{\varvec{H}}}^{l+1}}^{T}{{\varvec{H}}}^{l+1}\Vert }_{F}.\end{array}$$
(31)

When the pooling operator introduces more strategies for extracting features, the additional loss functions become more diversified. Since StructPool learns cluster assignment relations via conditional random fields, finding the optimal assignment is equivalent to minimizing Gibbs energy (Yuan and Ji 2020). Pooling methods based on graph capsule networks usually use a margin loss function to calculate classification loss and a reconstruction loss to constrain the capsule reconstruction to closely match the class-conditional distribution (Xinyi and Chen 2018; Yang et al. 2021a). Maximizing mutual information is another popular strategy of auxiliary training targets that uses mutual information neural estimation (Li et al. 2020b; Bandyopadhyay et al. 2020; Sun et al. 2021; Pang et al. 2021b; Roy et al. 2021; Lee et al. 2021). Additional losses, including the matrix decomposition loss function (Bacciu and Di Sotto 2019), the reconstruction loss (Zhong et al. 2022), the Kullback–Leibler (KL) loss (Knyazev et al. 2019; Khasahmadi et al. 2020; Tang et al. 2021), the deep clustering embedding loss function (Bi et al. 2021), and the spectral clustering loss (Maria Bianchi et al. 2020) are employed as training objectives to enhance the model’s capacity in representation. WGDPool utilizes a differentiable k-means clustering method and a multi-item parameterized loss function including cut, orthogonality, clustering, and reconstruction losses (Xiao et al. 2024).

3.4.5 Attention mechanisms

Attention, a quantitative measure for edges and nodes, has found extensive utility in GNNs and graph pooling operators (Vaswani et al. 2017; Knyazev et al. 2019). Existing pooling methods have integrated various attention mechanisms to adapt to diverse design requirements. GMT captures node interactions using a multi-headed attention mechanism with queries \(\mathcal{Q}\in {\mathbb{R}}^{ N\times {d}_{q}}\), keys \(\mathcal{K}\in {\mathbb{R}}^{ N\times {d}_{k}}\), and values \(\mathcal{V} \in {\mathbb{R}}^{ N\times {d}_{v}}\) as inputs and \(Att(\mathcal{Q},\mathcal{K},\mathcal{V})=w(\mathcal{Q}{\mathcal{K}}^{T})\mathcal{V}\), where \(w(\cdot )\) is an activation function (Vaswani et al. 2017; Baek et al. 2021). Guo et al. use linear transformations to generate key and value matrices for the attention operator, and they also create a gate vector that regulates the information flow of node features (Gao and Ji 2022). MSAPool uses the multi-headed attention method to discover task-relevant parts of the input data and learn each node’s global significance after pooling (Xu et al. 2022). In Region and Relation based Pooling (R2POOL), the dot-product rule of self-attention is utilized to calculate the similarity of each query vector with each key vector to identify node bi-directional pairwise similarities and relative significance at the graph scale (Aggarwal and Murty 2021).

In general, the attention weights between two nodes are usually generated as follows:

$$\begin{array}{c}{\alpha }_{i,j}=Softmax(\sigma ({{\varvec{a}}}^{T}[W{{\varvec{h}}}_{i}||W{{\varvec{h}}}_{j}])).\end{array}$$
(32)

The attention scores between two nodes are used to evaluate the similarity or correlation between two nodes. When only one node is taken into account, the attention weights of the nodes can be calculated as follows:

$$\begin{array}{c}{\alpha }_{i}=Softmax\left(\sigma \left({{\varvec{a}}}^{T}\left[{\varvec{W}}{{\varvec{h}}}_{i}\right]\right)\right),\end{array}$$
(33)

where \(\sigma (\cdot )\) could be the Sigmoid function or LeakyReLU function (Xinyi and Chen 2018; Li et al. 2019; Bi et al. 2021). Many pooling methods use an attention mechanism, which can be implemented by concatenating the representations of two nodes or different representations of one node, and then feeding it into a feedforward neural network (Xinyi and Chen 2018; Fan et al. 2020; Ranjan et al. 2020; Liu et al. 2021; Sun et al. 2021; Zhang et al. 2021b; Yu et al. 2021, 2022; Bi et al. 2021; Itoh et al. 2022; Li et al. 2022c). A simplified version is to use a projection vector to map the node representation to the attention score directly (Huang et al. 2019; Yuan and Ji 2020; Gao et al. 2020; Su et al. 2021; Lu et al. 2022; Wang and Ji 2023). Methods that compute attention scores solely based on node representations may not fully exploit the local structural information of nodes, thus some methods propose using GNNs to calculate node attention scores (Lee et al. 2019; Meltzer et al. 2019; Aggarwal and Murty 2021; Pang et al. 2021b; Duan et al. 2022). CGIPool computes a 1D attention score vector for each node in the input graph using parallel graph neural networks: \(Att({{\varvec{H}}}^{l},{{\varvec{A}}}^{l})=\sigma (GNN({{\varvec{H}}}^{l},{{\varvec{A}}}^{l}))\) (Pang et al. 2021b). For subgraph matching, H2MN (Hierarchical Hypergraph Matching Networks) calculates cross-graph attention coefficients among the hyperedges between graph pairs using cosine similarity scores (Zhang et al. 2021c). Similarly, LaPool learns the node-to-cluster assignment matrix via a soft-attention method measured by cosine similarity (Noutahi et al. 2019).

In addition to weighted summation-based readout functions described in Sect. 3.1.3, the attention mechanism is also used to compute the cluster assignment between nodes and supernodes in clustering pooling operators. Based on Memory Augmented Neural Networks (MANNs), Graph Memory Network (GMN) uses the clustering-friendly Student’s t-distribution to measure the normalized similarity for query-key pairs between nodes and clusters as the soft assignment probabilities (Khasahmadi et al. 2020). On the other hand, HAP employs an attention method similar to Graph Attention networks (GATs), in which pre-generated global cluster content vectors and node content vectors are concatenated and fed into a nonlinear network to learn the Master-Orthogonal-Attention (MOA) scores (Veličković et al. 2018; Liu et al. 2021). There are two popular ways for computing the attention mechanism in selection pooling operators: linear mapping (\({\varvec{\mu}}\)) of individual node features and GNN approaches based on nodes and node neighbors, as shown in Eqs. (34 and 35) (Knyazev et al. 2019). The former is the scalar projection and can be extended to MLP with a non-linear function \(\sigma (\cdot )\) like Eqs. (36 or 33) (Zhang et al. 2021b), and milt-head attention with mapping matrix \(\boldsymbol{\rm M}\) (Bi et al. 2021), whereas the latter is used by the selection pool operators including SAGPool (Lee et al. 2019), soft-mask GNN (SMG) (Yang et al. 2021b) and R2POOL (Aggarwal and Murty 2021). All of these approaches are concentrated on a single node, and most of them only have one feature perspective. As a result, there is a tendency toward combining increasingly sophisticated attention mechanisms, such as node-to-node attention (Duan et al. 2022), and key-value-query-based attention (Aggarwal and Murty 2021; Gao and Ji 2022).

$$\begin{array}{c}{{\varvec{s}}}_{att}=H\mu ,\mu \in {\mathbb{R}}^{d}\end{array}$$
(34)
$$\begin{array}{c}{{\varvec{s}}}_{att}=GNN\left({\varvec{A}},{\varvec{H}}\right)\end{array}$$
(35)
$$\begin{array}{c}{{\varvec{s}}}_{att}=MLP\left({\varvec{H}}\right)=\sigma \left({\varvec{H}}{\varvec{\mu}}\right)\end{array}$$
(36)

In summary, the attention mechanism in graph pooling operators is used to compute the assignment matrix of nodes and clusters (Xinyi and Chen 2018; Noutahi et al. 2019; Huang et al. 2019; Khasahmadi et al. 2020; Ranjan et al. 2020; Su et al. 2021; Baek et al. 2021; Liu et al. 2021; Sun et al. 2021; Yu et al. 2021; Li et al. 2022c), the local or global importance scores of nodes to select nodes (Lee et al. 2019; Huang et al. 2019; Gao et al. 2020; Aggarwal and Murty 2021; Pang et al. 2021b; Duan et al. 2022; Gao and Ji 2022), the similarity or correlation between nodes to learn the graph structure (Yuan and Ji 2020; Zhang et al. 2021b; Bi et al. 2021), or as a gating mechanism to control the integration of information of individual nodes (Meltzer et al. 2019; Fan et al. 2020; Baek et al. 2021; Yu et al. 2021, 2022; Zhang et al. 2021c; Xu et al. 2022; Itoh et al. 2022; Lu et al. 2022; Duan et al. 2022; Li et al. 2022c; Wang and Ji 2023). Yet, the breadth of applications has led researchers to subconsciously incorporate attention mechanisms when designing novel graph pooling operators while dismissing the applicability and potential functions. Knyazev et al. showcased the considerable potential of learned attention in GNNs and graph pooling layers, provided that it closely approximates optimality. However, attaining this proximity to optimality can be challenging due to sensitivity to initialization (Knyazev et al. 2019). The observation of the negligibility of the attention effect or even harmful under typical conditions suggests that the attention mechanism is still a significant open problem in graph pooling operators. Further development of our understanding of attention, as well as substantiating its effectiveness through rigorous, fair, and comparable experimental results, is imperative.

4 Applications in omics studies

Biological networks and molecular structures are two well-established graph modeling topics in bioinformatics data analysis (Zhang et al. 2021a). When modeling molecular structure as a graph, atoms or chemical substructures are usually treated as nodes, while bonds are treated as edges. In biological network modeling, the modeled entities are usually utilized as nodes, and the edges connecting nodes indicate the known association between pairs of entities. Unlike molecular structures, which have innated graph structures from nature, biological networks need to extract entities and model interactions between them. The entities represented by the biological network nodes cover molecule compounds, biomolecules, cells, and tissues, and the data used for modeling graphs across various omics, ranges from chemical structures to sequencing, expressions, and medical images (Jin et al. 2021). GNNs can be used to formulate and aggregate entity relationships in the graph data structure, and the graph embedding obtained by pooling the learned representations of all nodes in the graph can be used as a robust low-dimensional feature to preserve topological relationships between entities in biological networks (Wang et al. 2020b, 2021a). The bioinformatics applications of graph pooling operators in omics covered in this survey can be divided into three categories: genomics (Sect. 4.1), radiomics and other medical imaging (Sect. 4.2), and proteomics (Sect. 4.3). We conducted a selective review of applications on these omics data, with remainder in other omics, such as metabolomics or multi-omics, presented on GitHub.

4.1 Genomics

Data: The advent of high-throughput next-generation genomic technologies has catalyzed a surge in genomics data, propelling studies of DNA that encompass its structure, modification, and expression. Extensively studied biological networks involving genes, such as PPI, GRN, co-expression or correlation networks, disease networks, and other multi-omics networks, often developed from cohort studies that include multiple patients (Sulaimany et al. 2018). To obtain graphs that represent individual states, an intuitive approach is to augment gene networks derived from a population with individual characteristics as graph signals (Ramirez et al. 2020, 2021; Chereda et al. 2021). Pfeifer et al. incorporated gene expression and DNA methylation data as node features within a shared PPI network (Pfeifer et al. 2022). Despite identical graph topologies for all patients, the variable node features distinctively represent each patient’s unique cancer molecular profile. A Single-Sample Network (SSN) is a biomolecular network tailored from individual data and a reference set to delineate a person’s unique disease condition, providing insights into the personal response to pathophysiological changes (Liu et al. 2016). LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples) represents an alternative method for constructing single-sample networks, in which edge scores of an individual network are calculated by taking the difference between the edge scores of an aggregate network, which includes all samples, and a network reconstructed without the sample of interest (Kuijjer et al. 2019). Single-sample network methods yield gene networks with unique topologies tailored to each individual sample. Single-cell technologies have generated extensive omics data at the cell level, shifting the understanding of disease from the individual or tissue level towards a cell level, and intensifying insights into heterogeneity between cells. Constructing sample-specific graph signals on background networks (Wang et al. 2021b), or single-sample network by sample-specific network construction methods (Dai et al. 2019), are also applicable to single-cell data.

Tasks: The main goal of graph pooling in genomics networks is to improve classification, including stage, survival, grade and subtype classification. Additionally, such studies often employ model interpretation methods to identify cancer driver genes, network biomarkers or genes associated with diseases. In graph-level gene network analysis, the workflow includes building individual networks, crafting supervised GNN-based classification models, interpreting results, and assessing biological relevance. Given that single-cell data often encompasses multiple omics datasets, this involves tasks such as modality prediction, modality matching, and joint embedding (Wen et al. 2022). Graph-level single-cell network analysis involves both supervised tasks, like cell classification, and unsupervised tasks, such as cell clustering (Wang et al. 2021b; Hu et al. 2024).

Methods: Table 3 shows representative methods with graph pooling operators for processing biological networks from genomics. Ramirez et al. designed an edge pooling method that greedily selects pairs of nodes to merge through feature summation (Ramirez et al. 2020, 2021). Chereda et al. employed one-dimensional (1D) pooling in CNNs such as max pooling or average pooling, to process graph embedding matrices and cluster multiple neighbor nodes into a supernode (Chereda et al. 2021). The sigGCN employs a maxpooling layer to cluster a specified number of nodes into a supernode (Wang et al. 2021b). Hou et al. developed a truncated differentiable clustering pooling operator that efficiently condenses the cluster assignment matrix, creating finite gene clusters without overlap (Hou et al. 2024). Hu et al. employed differentiable clustering pooling to first identify and then align tissue cellular neighborhoods (TCNs) across spatial maps, generating an embedding that maintains the integrity of TCN partition information (Hu et al. 2024). Liang et al. leveraged SAGPool on pathway-based gene networks for cancer prognosis, utilizing interpretive algorithms to pinpoint survival-related pathways (Liang et al. 2022).

Table 3 Representative methods for processing biological networks from genomics

Discussion: The most frequently used graph pooling in genomics networks is clustering pooling in hierarchical pooling, followed by global pooling operators. Ramirez’s research pioneered a data-driven model utilizing GNNs for classifying cancer subtypes (Ramirez et al. 2020). The model was trained on 11,071 samples, 33 cancer types and four distinct network types, achieving a cancer subtype prediction accuracy that surpasses or matches previously documented ML algorithms. Wang et al. evaluated sigGCN against four traditional ML methods over seven datasets and their findings revealed equivalence on smaller datasets, while sigGCN outperformed the traditional ML on larger, more complex datasets (Wang et al. 2021b). Ramirez et al. applied the same model to cancer subtype classification and survival prediction tasks, demonstrating the scalability of pooling operators across different datasets and tasks (Ramirez et al. 2020, 2021). Considering that neural network decisions require explanation prior to clinical consideration, nearly all genomics studies now incorporate interpretative analysis to provide both model insights and clinically relevant information (Chereda et al. 2021; Liang et al. 2022). Additionally, Hou et al. used different modules for clustering and classification, while Hu et al. employed similar architectures for multiple tasks, showcasing the potential of GNNs with pooling operators for multi-task learning (Hou et al. 2024; Hu et al. 2024). However, these studies rarely consider the leveraging auxiliary information from related tasks, thus failing to enable mutual task enhancement. In genomics research, a crucial factor is the single-sample network construction method. Most current methods are applied to homogeneous graphs, with limited focus on heterogeneous data. Li et al. aimed to adapt clustering pooling for heterogeneous graphs, first dimensionally aligning heterogeneous nodes via a linear layer, then clustering multiple nodes into a supernode on homogeneous graphs (Li and Nabavi 2024). However, there is still a lack of extensive experimental evaluation of various pooling operators across multiple data types, graph construction methods, and tasks.

4.2 Radiomics

Data: Radiomics involves the rapid collection of extensive medical images from technologies like computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET) (Li et al. 2022a). Antonelli et al. introduced ‘omics imaging’ to describe the combination of biomedical imaging with omics data, analyzing both histological and radiomic images (Antonelli et al. 2019). Medical images are gold standards in the medical practice for the diagnosis, classification, and prognosis of clinical diseases. Medical images offer a wealth of data on biological entities and their interconnections, making it possible to extract graphs using diverse approaches. Brain connectivity networks, for example, map the brain’s graph structure, while tissue images facilitate the creation of cellular or tissue entity graphs. By representing the brain as a functional connectivity graph, functional magnetic resonance imaging (fMRI) has made significant progress in comprehending the brain. In this immensely complex system, nodes are defined as voxel or brain regions of interest (ROIs), and edges are defined as functional connectivity between those ROIs, usually calculated as pairwise correlations of fMRI time series (Li et al. 2021b). Medical images of other tissue regions, such as histopathology images based on cell or tissue structure patterns (Adnan et al. 2020; Martin-Gonzalez Paula and Crispin-Ortuzar 2021; DI et al. 2022; Pati et al. 2022; Gao et al. 2022c), CT and MRI of other lesion sites (Huang et al. 2021; Pang et al. 2021a), can also be constructed into graph structures. A node as a cell in the graph represents an image region, and the edges are defined by their spatial distance. The rapid digitization of pathology slides into high-resolution whole-slide images (WSIs), which has facilitated whole-slide scanning systems, has ushered a revolution in computer-aided diagnostics. Building WSI networks involves detecting entities at tissue or cellular levels, patch-based sampling, or dynamically forming networks from CNN-extracted higher-order features (DI et al. 2022; Pati et al. 2022; Gao et al. 2022c). Single-sample network methods like SSN and LIONESS are also suitable for building networks from WSIs’ high-order features, capturing the intricate interactions within (Kuijjer et al. 2019; Duroux et al. 2023).

Tasks: The main goal of using graph pooling in radiomics networks is to improve classification, with labels derived from clinical data. Additionally, graph pooling is expected to link the classification results with the pathological annotations and clinical features, thereby offering pathological explanations and clinical insights for each sample. Understanding mental diseases such as schizophrenia and Alzheimer’s disease is an essential objective of applying GNNs to brain connectivity networks, hence one basic task of GNNs with graph pooling is disease diagnosis, i.e., separating controls and cases via brain networks (Li et al. 2020c, 2021b; Sebenius et al. 2021; Hu et al. 2021a; Gopinath et al. 2022; Song et al. 2022; Zhao et al. 2022). Another objective is to efficiently differentiate and explain brain states through model explanation, which helps to further research into the brain’s operating mechanisms (Li et al. 2021b; Gopinath et al. 2022; Gao et al. 2022b; Zhang et al. 2023b). In general, the overall process of brain network analysis involves brain map construction, GNN-based classification, and possible subsequent analysis, such as ROI extraction and visualization (Li et al. 2021b). This advancement of medical imaging, in particular the WSIs, has enabled the application of artificial intelligence to address a variety of pathology tasks, encompassing tumor detection, tumor staging, and survival analysis (Adnan et al. 2020; Martin-Gonzalez Paula and Crispin-Ortuzar 2021; DI et al. 2022; Pati et al. 2022; Gao et al. 2022c). The two-stage pipeline of GNN on WSI generally starts from building the network and extracting features and then building the GNN for classification or regression prediction. Diverging from supervised tasks, Zheng et al. and Özen et al. concentrate on the unsupervised GNN-Hash for graph encoding, establishing the retrieval indexes of WSIs to enhance ROI retrieval for auxiliary diagnosis (Zheng et al. 2019; Özen et al. 2020). Wang et al. focus on weakly supervised learning, employing image-level labels instead of pixel-level annotations to aid pathologists with WSI Gleason grading (Wang et al. 2020a).

Methods: Table 4 shows representative methods with graph pooling operators for processing biological networks from medical images. The most frequently used medical image network classification method is selection pooling in hierarchical pooling, followed by attention-based global pooling operators. To investigate human brain states, Zhang et al. propose a novel domain knowledge-informed self-attention graph pooling-based graph convolutional neural network (DH-SAGPool) that keeps certain significant nodes by calculating the score of each node in the graph as \({\varvec{s}}=\sigma ({\varvec{A}}{\varvec{H}}{\varvec{W}})\) (Zhang et al. 2023b). Tang et al. propose a Hierarchical Signed Graph Pooling (HGP) module consisting of four steps: (1) calculate Information Scores (ISs); (2) selection of top-K informative hubs; (3) feature aggregation; and (4) graph pooling, where the IS contains balanced and unbalanced components to measure information from balanced and unbalanced node sets (Tang et al. 2022). Focus on the contribution of each node to the final prediction result, all nodes in the model GAT-LI share a mapping vector to assign weights to node representations, and existing methods such as GNNExplainer are integrated to understand the model and highlight important features (Ying et al. 2019; Hu et al. 2021a). Li et al. propose maximum mean discrepancy (MMD) loss and binary cross entropy (BCE) loss to accentuate the distinction, aiming to overcome the limitation of existing approaches where ranking scores for the rejected nodes and the remaining nodes may not be discernible (Li et al. 2020c). Moreover, group-level consistency (GLC) loss is intended and commonly used to force the pooling operator to select similar significant nodes or ROIs for different input instances (Li et al. 2020c, 2021b; Gao et al. 2022b).

Table 4 Representative methods for processing biological networks from radiomics

Various GNN architectures have shown considerable applicability in medical image network classification. Inspired by the DiffPool, Gopinath et al. propose an end-to-end learnable pooling strategy for the subject-specific aggregation of cortical features (Gopinath et al. 2022). This model is structured hierarchically, with sequential Graph Convolution + Pooling (GC + P) blocks and two FC layers. In addition, UGPool utilized the JK net architecture in graph classification on brain connectivity experiments (Qin et al. 2020). Notably, a GNN architecture between hierarchical and straightforward structures is often used for medical image classification, which usually consists of several GCN layers sequentially linked with a hierarchical pooling layer and a global pooling layer, as shown in Fig. 4. In this architecture, a single hierarchical pooling operator, typically selection pooling operator, is utilized to assess important nodes. The importance coefficients of nodes represent the preferences of the model and provide insights into the diverse roles played by different regions represented by nodes during classification.

Fig. 4
figure 4

GNN architectures for medical images

Discussion: Graph pooling operators are important in medical image analysis at the following levels: global pooling operators are used to obtain a representation of the graph for classification (Adnan et al. 2020; Huang et al. 2021; Hu et al. 2021a; DI et al. 2022; Pati et al. 2022; Gao et al. 2022c); hierarchical pooling operators are used to extracting features at different levels (Qin et al. 2020; Huang et al. 2021; Pang et al. 2021a; Tang et al. 2022; Gopinath et al. 2022; Song et al. 2022); and attention-based global pooling and selection pooling is used to evaluate the contribution of nodes to the classification in order to find important nodes (Li et al. 2020c, 2021b; Adnan et al. 2020; Martin-Gonzalez Paula and Crispin-Ortuzar 2021; Sebenius et al. 2021; Hu et al. 2021a; Tang et al. 2022; Song et al. 2022; Gao et al. 2022b; Zhao et al. 2022). CGC-Net employs DiffPool as its graph clustering module, visualizing the cell graphs after clustering (Zhou et al. 2019). The first advantage of graph pooling operators lies in their outstanding prediction capabilities, achieving success across various data tasks. DH-SAGPool, for instance, has demonstrated exceptional classification performance surpassing existing methods in brain state classification experiments, ranging from binary to seven-class classification tasks (Zhang et al. 2023b). The effect of pooling in brain connectivity network tasks is sensitive to parameters, where a pooling ratio below a threshold encourages the model to use fewer parameters and the pooling operator to increase robustness to noise, and perform effective dimensionality reduction (Sebenius et al. 2021). However, an extremely low pooling ratio could result in to severe information loss. Another advantage of pooling operators lies in providing interpretability to the model, thus inspiring interpretation for specific biological or medical questions. The node selection pooling layers give BrainGNN intrinsic interpretability, enabling the discovery of significant brain regions that contribute informative features for the prediction task at different levels (Li et al. 2021b).

Despite its widespread use as a method of identifying key nodes for model interpretation intuitively, selection pooling has a shortcoming in terms of quantitative evaluation and theoretical support (Li et al. 2021b). Another challenge faced by graph pooling operators when attempting to understand specific medical or biological problems is the domain-agnostic property that the explanations provided by computational models may not be equivalent to interpretation in the context of medicine or biology (Karim et al. 2023). Intuitively, the nodes or features that significantly contribute to the model’s classification may not necessarily include biologically relevant factors or exhibit a strong correlation with expected biological effects (Hu et al. 2021a). The correlation between these two types of explanations often lacks sufficient experimental or theoretical support. A trend is emerging to enhance this correlation by incorporating extensive domain knowledge into data processing and model computation processes (Zhang et al. 2023b; Karim et al. 2023). Clustering-based hierarchical pooling is less frequently used in medical image analysis, although clustering pooling on voxels in SpineParseNet is crucial to constructing adaptive network topologies (Zhou et al. 2019; Pang et al. 2021a). On the other hand, BrainGNN reconciles node community partition and key node extraction by integrating structure-aware GCN and node selection graph pooling operators, offering insights for other applications (Li et al. 2021b).

4.3 Proteomics

Data and Tasks: Proteomics is the comprehensive analysis of the proteome, encompassing the structure, abundance, function, modifications, localization, and interactions of proteins (Li et al. 2022a). One of the most important tasks in computational drug discovery is protein–ligand binding affinity prediction. In the drug discovery field, ligands often refer to drug candidates, encompassing small molecules and biologics, which act as agonists or inhibitors in biological processes with the potential to treat diseases. Binding affinity, which quantifies the strength of the interaction between a protein and a ligand, is typically determined through rigorous and time-intensive experimental procedures. Proteins and ligands can be described as graphs based on their structures, with atoms typically serving as nodes and the edges between the nodes defined by interactions. Due to their multi-level structure, proteins can be represented as graphs in multiple levels. For instance, Nikolaienko et al. utilized receptor secondary structures as nodes of the graph for proteins (Nikolaienko et al. 2022). In general, ligands are small molecules, and when the ligand is also a protein, this computational task is comparable to PPI prediction (Réau et al. 2023; Huang et al. 2023).

Methods: Table 5 shows representative methods with graph pooling operators for molecular structures. The expansion of three-dimensional (3D) protein structure data and advancements in structural biology have facilitated the establishment of a conceptual framework for structure-based drug discovery (Senior et al. 2020; Jumper et al. 2021; Zhu et al. 2022). Li et al. propose the Structure-aware Interactive Graph Neural Network (SIGN) to learn the constructed complex graph for predicting protein–ligand binding affinity, it uses pairwise interactive pooling (PiPool) to calculate the interaction matrix between different types of atoms in proteins and their ligands for leveraging long-range interactions (Li et al. 2021a). PSG-BAR (Protein Structure Graph-Binding Affinity Regression) computes cross-attention of protein node and ligand representations and employs global pooling based on attention to compute virtual node representation for prediction (Pandey et al. 2022). Li et al. employ a linear layer to learn assignment weights for each node, serving as an attentive pooling layer (APL) to learn hierarchical structures and compress the linear graph of protein–ligand complexes (Li et al. 2023a). To execute graph readout, GraphSite uses Set2Set as a global pooling function to decrease the size of a graph to a single node (Vinyals et al. 2016; Shi et al. 2022). To capture the distribution of each dimension in node representations, graphDelta applies a fuzzy histogram technique, which predefines a set of bins along with associated membership functions and applies them to the node representations (Karlov et al. 2020). This yields a graph representation vector with a length equal to the product of the node representation length and the number of bins specified. ProteinGCN incorporates both local and global pooling based on grouping and cascading that organize nodes by the proteins’ inherent residue structures, effectively learning residue and decoy embeddings (Sanyal et al. 2020).

Table 5 Representative methods for processing protein structures

GraphBAR uses a parallel GNN architecture with multi-channel feature extraction followed by fusion (Son and Kim 2021), whereas GraphSite extends the standard convolution-readout architecture with JK connections (Shi et al. 2022). Notably, as illustrated in Fig. 5, many models adopt a unique two-channel GNN architecture to include two molecular structure networks in protein–ligand affinity or PPI predictions (Torng and Altman 2019; Shen et al. 2021; Jiang et al. 2021; Nikolaienko et al. 2022; Li et al. 2022b; Pandey et al. 2022; Réau et al. 2023; Xia et al. 2023; Huang et al. 2023). These models extract features and learn representations of proteins or ligands using two GNNs, and then fuse them using specific rules with a concatenation operation. DeepRank learns the representations of two interacting proteins using hierarchical pooling operators based on the Markov Cluster Algorithm (MCL) or Louvain community detection algorithm and a two-branch hierarchical GNN, then flattened representations, merged, and fed into the FC networks (Réau et al. 2023). The geometry-aware interactive graph neural network (GIANT) employs the decoupled cross pooling module to learn initial representations for protein and ligand molecules, and the fused global pooling module to fuse their representations and capture the interaction between proteins and ligands (Li et al. 2023b). Both of these pooling modules are implemented based on attention mechanisms and Gated Recurrent Unit (GRU).

Fig. 5
figure 5

GNN architectures for protein–ligand binding affinity prediction

Discussion: In general, simple function readout functions and their combinations are preferred in current studies when considering molecular structures for protein–ligand structure prediction (Torng and Altman 2019; Cho et al. 2020; Son and Kim 2021; Shen et al. 2021; Gligorijević et al. 2021; Jiang et al. 2021; Lai and Xu 2022; Li et al. 2022b; Yang et al. 2023b; Xia et al. 2023; Huang et al. 2023). Global pooling operators can complement the local embeddings learned by GCNs by capturing global information. In SIGN, for instance, the PiPool is developed in a semi-supervised manner to model global long-range interactions between atoms in proteins and ligands, thereby enhancing the effectiveness and generalizability of the model (Li et al. 2021a). Clustering pooling, on the other hand, is ideal for detecting clusters in proteins with closer 3D spatial proximity or stronger interactions, as it can integrate distance during the clustering process (Réau et al. 2023). Jiang et al. reviewed ML-based models’ performance on PDBBind’s V2016 and V2013 core sets for protein–ligand interaction prediction (Jiang et al. 2021). Concurrently, Li et al. benchmarked SIGN against multiple ML-based methods, assessing their spatial and long-range interaction capture (Li et al. 2021a). Findings revealed non-spatial methods as least effective, ML methods showed constrained generalization on external datasets, while GNN-based approaches, integrating both features, outperformed. GNNs with pooling exhibit a parameter count that is 1–2 orders of magnitude lower than that of 3D convolutional neural networks (Sanyal et al. 2020). Pooling operators also contribute to the interpretability in the research of molecular structure graphs. The PSG-BAR uses attention-based global pooling to identify surface residues as critical residues for protein–ligand binding by ranking attention scores (Pandey et al. 2022). Jiang et al. investigated intermolecular interactions by computing and visualizing the similarity of protein–ligand complex embeddings obtained through pooling (Jiang et al. 2021). They further explained how atomic pairwise interactions in protein–ligand complexes, as reflected by the weights of atom-residue pairs within the pooling module, influenced the final prediction. This analysis validated the consistency between the computational model and expert knowledge. In related tasks involving protein structure graphs, the parameter sensitivity of pooling operators remains a significant factor. In Struct2GO, a pooling operator based on self-attention mechanism is employed for protein function prediction (Jiao et al. 2023). However, both excessively high and low pooling ratios fail to achieve optimal performance.

5 Conclusion

Graph neural networks that process graphs with neural networks excel in various graph-related tasks. The GraphRec framework, for instance, successes in social recommendation by analyzing user-item graph interactions, surpassing baseline methods in real-world datasets (Fan et al. 2019). In bioinformatics, GNNs have proven instrumental for predicting associations between long non-coding RNAs (lncRNAs) and diseases, as well as inferring the relationship among lncRNAs, microRNAs (miRNAs), and diseases (Sheng et al. 2023a, b). Beyond node and edge-level analysis, GNNs extend their utility to graph-level tasks, exemplified by their ability to leverage protein structural information for accurate functional prediction, thereby opening new avenues in computational biology (Sanders et al. 2023). Graph pooling is a key module to bridge node representation learning and specific graph-level tasks. This review presents a comprehensive survey of pooling operators in GNNs and their applications in omics from multiple perspectives. The global pooling and hierarchical pooling operators are classified and summarized, along with the details of prevalent methods of hierarchical pooling operator, including clustering pooling, node selection pooling, edge pooling and hybrid pooling. Besides, we discussed existing benchmark datasets and fair evaluation frameworks for graph pooling. Representative applications of graph pooling operators in graphs of molecule structures and medical images for drug discovery and disease diagnosis are also summarized. Via several examples of brain connectivity network analysis and protein–ligand affinity prediction, we demonstrated how graph pooling could benefit omics applications with prediction performance and interpretability.

Despite significant progress in graph-level learning, there are still unresolved challenges for graph pooling. Lastly, we discuss some prospective research directions of graph pooling to encourage continued investigation in this field.

5.1 Large-scale graphs and graph foundation models

Despite achieving state-of-the-art performance on many small benchmark datasets, graph pooling operators face significant challenges when it comes to large-scale datasets. The requirements for expressive power, as well as the computational time and space costs, become more demanding. Currently emerging graph foundation models are frontiers integrating graph structure with large language models (Tang et al. 2023; Zhao et al. 2023; Tian et al. 2024). These models are expected to possess versatile graph reasoning capabilities, including understanding basic topological graph properties, reasoning over multi-hop neighborhoods, and capturing global properties and patterns (Zhang et al. 2023c). This aligns well with the advantages exhibited by graph pooling operators, which raise the prospect of pooling as an essential component in future graph foundation models for handling large-scale graph-level tasks. In omics research, scGPT, a foundational model for single-cell biology, explores the applicability of foundational models in advancing cell biology and genetics research (Cui et al. 2024). Current findings indicate that scGPT effectively extracts key biological insights related to genes and cells, excelling in downstream applications like multi-omics integration and gene network inference (Cui et al. 2024). Graph pooling operators are expected to enhance these downstream applications and be integrated into graph-based foundational models.

5.2 Expressive sparse graph pooling

Clustering pooling methods maintain the expressive power of message passing layers through dense cluster assignment matrices with row sums of one, after a suitable normalization (Ying et al. 2018; Bianchi and Lachi 2024). Conversely, sparse graph pooling operators, dependent on node selection, will yield assignment matrices where not all row sums are one, regardless of the scoring computation (Lee et al. 2019; Grattarola et al. 2022; Bianchi and Lachi 2024). Despite the presence of expressive sparse operators, a common limitation is the inability to directly specify the number of supernodes.

5.3 Graph embedding for complex graphs

All graph pooling operators discussed in this paper are assumed to work on homogeneous graphs, whether weighted or attributed. However, real-world graphs are often complex, encompassing both heterogeneous graphs (with various types of nodes and edges) and hypergraphs (where multiple nodes are linked together via hyperedges). Most existing methods for heterogeneous graph embedding focus on node embeddings, with few addressing graph-level embeddings. Consequently, heterogeneous graph pooling emerges as an intriguing solution for graph classification on heterogeneous graphs, with the challenge of preserving heterogeneity and capturing the distinct features of various heterogeneous structures (Yang et al. 2022; Bing et al. 2023). Similarly, there is a need to develop solutions capable of learning hypergraph representations, employing permutation-invariant functions that aggregate node/hyperedge representations in a meaningful way for downstream tasks (Antelmi et al. 2023). Heterogeneous graphs and hypergraphs are effective tools for representing interaction and relationships in omics data. Heterogeneous nodes represent multi-omics data, heterogeneous edges integrate various biological annotations, and hyperedges reveal high-order associations of biomolecular (Deng et al. 2024). Current omics research have to simplify complex graphs into homogeneous ones due to the lack of specialized graph pooling operators for heterogeneous graphs and hypergraphs.

5.4 Unsupervised tasks

Most graph pooling operators discussed in this paper are used for supervised tasks, with few exploring unsupervised tasks like graph generation (Baek et al. 2021; Guo et al. 2023) and node clustering (Wang et al. 2022; Tsitsulin et al. 2023). Hierarchical GNN architectures, associated with hierarchical pooling, naturally fits with hierarchical network structures and tasks such as hierarchical community detection (Su et al. 2022), suggesting a potential expansion of existing operators to more unsupervised tasks or the development of new operators for such tasks. In omics research, the abundance of unlabeled data and the challenge of obtaining labels often result in annotations lagging behind data generation. Consequently, unsupervised tasks are likely to become a major research focus, or unsupervised data may enhance supervised tasks. One approach is to use a general GNN architecture with different modules for different tasks like classification and clustering (Hou et al. 2024; Hu et al. 2024).

5.5 Interpretability

In domain-specific scenarios, the interpretability of graph pooling operators—and graph neural networks by extension—relies on two key aspects: 1) identifying correlations between model outputs and original input features and 2) revealing domain-specific insights for the relevant features identified by the model. Explanation components in operators for self-explaining and domain-knowledge driven interpretability should thus become fruitful future research directions (Yang et al. 2023a; Karim et al. 2023; Wysocka et al. 2023). In omics research, one available domain-specific interpretability reference is bio-centric model interpretability. It grounds interpretability in a biomedical context through three aspects: architecture-centric interpretability, output-centric interpretability, and post-hoc evaluation of biological plausibility (Wysocka et al. 2023). These are assessed via four components: integration of different data modalities, schema-level model representation, integration of domain knowledge, and post-hoc explainability methods (Wysocka et al. 2023). The schema-level representation requires GNNs to understand graph representations and transformations and communicate such transformation during the post-hoc inference, where clustering pooling operators are essential. Post-hoc explainability methods necessitate model architectures that mirror biological relationships, track information flow and find the importance of model’s components, highlighting the importance of selection pooling operators.

We hope that this paper can provide a useful framework for researchers who are interested in graph pooling. Although GNN with pooling now plays a key role in many biological tasks and produces outstanding outcomes, the pooling operator among them is still confined to a few methods. The exploration of varied graph pooling operators’ potential in omics studies is ongoing.