K-plex cover pooling for graph neural networks

Graph pooling methods provide mechanisms for structure reduction that are intended to ease the diffusion of context between nodes further in the graph, and that typically leverage community discovery mechanisms or node and edge pruning heuristics. In this paper, we introduce a novel pooling technique which borrows from classical results in graph theory that is non-parametric and generalizes well to graphs of different nature and connectivity patterns. Our pooling method, named KPlexPool, builds on the concepts of graph covers and k-plexes, i.e. pseudo-cliques where each node can miss up to k links. The experimental evaluation on benchmarks on molecular and social graph classification shows that KPlexPool achieves state of the art performances against both parametric and non-parametric pooling methods in the literature, despite generating pooled graphs based solely on topological information.


Introduction
Graph Neural Networks (GNNs) allow for the adaptive processing of topology-varying structures representing complex data which comprises atomic information entities (the nodes) and their relationships (the edges).
The processing of graphs by neural networks typically leverages on message passing between neighboring nodes ) to collect and exchange information on the context of nodes. Such process is made effective and efficient, also on cyclic structures, by feedforward neural layers with node-level weight-sharing, an approach popularized under the term graph convolutions (Kipf and Welling 2017), but previously known as contextual structure processing (Micheli 2009;Bacciu et al. 2018).
Graph pooling methods provide mechanisms for structure reduction that are intended to ease the diffusion of such a context between nodes farther in the graph. These methods developed from the original concept in image processing, realizing structure reduction layers that are interleaved between graph convolutions to provide a multi-resolution view of the input graph. This is intended to extract coarser and more abstract representations of the graph as we go deeper in the network, boosting information spreading among nodes. In this respect, graph pooling can help to contain model complexity, by reducing the number of convolutional layers needed to achieve full coverage of the graph, and counteract oversmoothing issues (i.e. nodes converging to very similar embeddings) by introducing structural diversity among convolutional layers .
The definition of a robust, general and efficient subsampling mechanism that can scale to topology-varying graphs is made difficult by the irregular nature of the data and by the lack of a reference ordering of nodes between samples. Graph pooling methods in the literature are addressing the problem from a community discovery perspective, more or less explicitly, by considering node connectivity patterns (Simonovsky and Komodakis 2017;Ma et al. 2019;Wang et al. 2020;Defferrard et al. 2016) or by aggregating the nodes based on their similarity in the neural embedding space (Ying et al. 2018;Lee et al. 2019;Bianchi et al. 2020).
Our work hinges on an explicit (and novel) link between pooling operators and two consolidated graph theoretical concepts in community discovery, namely, k-plexes and graph covers. The former ones provide a flexible formalization for a community of nodes as a densely interconnected and cohesive subgraph, which relaxes the definition of a clique by allowing nodes to miss up to k links each. The latter ones relate to a soft-partition of nodes whose union covers all nodes on the original graph. We argue that both concepts are necessary to realize an effective and general graph pooling mechanism as they permit to summarize the overall community structure by taking a small but meaningful set of highly connected components, represented by the k-plexes.
We introduce KPlexPool, a novel pooling method using only topological graph features, which is not parameterized nor its outcomes depend on the specific predictive task. Hence, its structure reduction can be precomputed once and reused across multiple learning model configurations. We show how KPlexPool, despite being nonadaptive, provides a flexible and robust definition for node communities which can generalize well to graphs of different nature and topology, from molecular structures to social networks.
We remark that KPlexPool is not just a straightforward application of k-plexes to deep learning for graphs. Rather, our pooling mechanism is the result of the careful design and integration of different community discovery and graph simplification mechanisms specifically crafted to obtain a hierarchical structure coarsening algorithm well suited to promote effective context diffusion in neural message passing. This goal is quite challenging as, for instance, a previous attempt by Luzhnica et al. (2019) clearly shows that the straightforward application of clique discovery does not yield an effective graph pooling method. We hope that our work can further stimulate the community interests in graph theory and algorithms, which have been excellent sources of inspiration for the machine learning community, such as the Weisfeiler-Lehman graph kernel (Shervashidze et al. 2011;Kriege et al. 2020;Vishwanathan et al. 2010) to name one of these achievements. The main original contributions of this paper are summarized below.
-We propose KPlexPool, the first pooling mechanism based on the concepts of k-plex communities and graph covers (Sect. 2.3). -We define a scalable algorithm to compute a hierarchy of k-plex cover decompositions that optimizes context propagation and promote diversity in convolutional layers (Sect. 2.4). -We propose a post-processing cover heuristic for sparsifying pooled graphs in scale-free structures (Sect. 2.6). -We provide a thorough and reproducible empirical assessment on 9 graph classification benchmarks, where KPlexPool is shown to be state-of-the-art (Sect. 4).
We remark that while the notion of clique-covering has been studied since the 70s, we are not aware of any formal definition or algorithm for k-plex covering in literature. Hence, our contribution is novel in: i) the use of k-plex communities to define pooling mechanisms in neural processing systems; ii) the definition of the notion of k-plex cover; iii) the definition of an efficient algorithm to implement the concepts above.

K-plex cover graph pooling
In this section we present KPlexPool, a novel method for graph coarsening that uses k-plexes as a pooling block. In Sects. 2.1, 2.2 we introduce some useful definitions from graph theory and deep learning that will be used throughout this paper. In Sect. 2.3, we begin by defining how k-plexes can be used to coarsen connected communities in the graph and how edges can be determined in the pooled structure. In Sect. 2.4 we describe the algorithms to efficiently compute the k-plex covers. Finally, in Sect. 2.6 we introduce a useful heuristic for sparsifying scale-free structures.

Preliminaries on graph theory
Given an undirected graph G, let V = V (G) be its node set and E = E(G) be its edge set, where v(G) = |V | = n and e(G) = |E| = m are, respectively, the number of nodes and edges in G. Given an edge e = {u, v}, nodes u and v are said to be adjacent  (v) of v is the set of nodes adjacent to it, and the degree d(v) of v is defined as the number of its neighbors, i.e. |N (v)|. An attributed graph is a tuple (G, φ, ψ) where φ : V → R h V and ψ : V × V → R h E are functions that assign a vector of features to each node and to each edge, respectively of size h V and h E . If e / ∈ E, then ψ(e) = 0. An attributed graph can be also represented in matrix notation (A, X) by taking an arbitrary (usually predefined) ordering of its nodes V = {v 1 , . . . , v n } and by having X ∈ R n×h V as its node-feature matrix, whose rows are defined as A k-plex is a subset of nodes S ⊆ V such that each node in S has at least |S| − k adjacent nodes in S: for all v ∈ S, we have |N (v) ∩ S| ≥ |S| − k. This definition is quite flexible as for k = 1 we get the classical clique and for larger values of k we obtain a relaxed and broader family of (possibly larger) subgraphs of G. Some examples are shown in Fig. 1. A k-plex cover of G is a family of subsets S of V such that each set S ∈ S is a k-plex and their union is S∈S S = V .

Preliminaries on graph neural networks
Graph Neural Networks are a specialized kind of neural network that adaptively learns fixed-size representations of a set of graphs in a given distribution. GNNs are typically defined in terms of graph convolutions, both in the spectral (Bruna et al. 2014;Defferrard et al. 2016;Kipf and Welling 2017) and in the spatial domain ( Hamilton et al. 2017;Monti et al. 2017;Xu et al. 2018;Veličkovićet al. 2018;Xu et al. 2019;Morris et al. 2019). Specral GNNs perform graph convolution using the graph Fourier transform, or by approximating it with a (truncated) Chebyshev expansion (Hammond et al. 2011). Spatial GNNs, instead, perform graph convolution by applying a learned filter to the features of each node and its local neighborhood. Most GNNs can be described as instances of the so-called Message-Passing Neural Network model (Gilmer et al. 2017), which takes inspiration from the classical message-passing paradigm and from the very first learning approaches to graph-structured data (Micheli 2009;Scarselli et al. , 2009. Battaglia et al. (2018) further abstract this model by proposing a more general form, GraphNet, where also edge-and graph-level attributes can be parametrized.
GNNs often leverage on pooling techniques to obtain a coarsened representation of the graphs in input. As in Convolutional Neural Networks (Fukushima 1980;LeCun et al. 1989), pooling serves both as dimensionality reduction and to reduce the distance between objects in the input, thus increasing the receptive field of parametric functions applied on its output. Graph pooling is typically performed using classical clustering algorithms such as Graclus (Dhillon et al. 2007;Bruna et al. 2014;Defferrard et al. 2016;Simonovsky and Komodakis 2017;Ma et al. 2019;Wang et al. 2020), or adaptively, as in Ying et al. (2018);Cangea et al. (2018); Gao and Ji (2019); Lee et al. (2019). A more in-depth comparison will be discussed in Sect. 3. Pooling is also used to obtain a final, global representation of the whole graph in input. This technique, which is usually denoted as global pooling, is usually performed by using standard aggregation functions such as sum, max or mean (Xu et al. 2019), or by exploiting techniques from the field of (multi-)set representation learning, such as GlobalAt-tentionPool (Li et al. 2016), Set2Set (Vinyals et al. 2016), and SortPool (Zhang et al. 2018).

Graph pooling with k-plexes
KPlexPool computes a k-plex cover S = S 1 , . . . , S c of the input graph (G, φ, ψ), for a given k, and returns a coarsened graph (G , φ , ψ ), such that Node v i represents the coarsened version of S i , and edge {v i , v j exists iff there is at least one edge in G linking a node of S i with a node of S j . The feature functions φ : V → R h V and ψ : V × V → R h E aggregate, respectively, features belonging to the same k-plex S i and features of edges linking two different S i and S j . In other words, they are defined in such a way to provide a suitable relabeling for nodes and edges in the coarsened graph: where β and γ are arbitrary aggregation functions defined over multisets of feature vectors. Typical aggregators for node attributes are element-wise max or sum (Xu et al. 2019). For edge weights, our choice is the sum reduction: if the input graph has ∀e. ψ(e) = 1, then the weight of an edge in the coarsened graph is the number of edges crossing the two linked k-plexes. Figure 2 shows an example of a graph and the hierarchical reduction produced by the application of a series of three sum-pooling. Differently from other partitioning-based graph coarsening methods (Dhillon et al. 2007;Ng et al. 2002), in our approach a node may belong to multiple k-plexes. This is also a key difference between CliquePool (Luzhnica et al. 2019) and KPlexPool with k = 1 (i.e., performing a clique cover), where the former model forces a partition between nodes potentially destroying structural relationships in the communities.
From Eq. 2.2 and by the fact that every edge forms a clique, it follows that whenever two k-plexes share a node, their respective aggregated nodes in the coarsened graph are adjacent. For this reason, not only G can be denser than G, but KPlexPool may also produce G such that e(G ) > e(G) on some pathological cases (e.g., star graphs). For

Fig. 2
Two examples sum-pooling on the same graph, where φ(v) = 1 and ψ(e) = 1. In the first example (top), the hierarchy is formed by clique pooling (i.e., k-plex pooling with k = 1). In the second example (bottom), the first layer of the hierarchy is obtained using 2-plex pooling. Numbers in roman and italic fonts represent edge and node attributes, respectively this reason, our KPlexPool algorithm incorporates a graph sparsification method, whose details are discussed in Sect. 2.6.
To provide a compact (vectorial) definition of our operator, we can represent the k-plex cover as a matrix S ∈ {0, 1} n×c of hard-assignments, i.e., S i j = v i ∈ S j . If we use sum for both aggregation functions, it is easy to see that Eqs. 2.3, 2.4 can be rewritten in matrix form as where X ∈ R c×h V and A ∈ R c×c×h E are, respectively, the node feature and adjacency matrix of G .

Algorithm 1 KPlexCover
input A graph G, an integer k ≥ 1, and two priority functions f and g.
Suitably update priority f .

Algorithm 2 FindKPlex
input A graph G, an integer k ≥ 1, a priority function g, and a pivot node v.
Suitably update priority g.

K-plex cover algorithm
We propose an algorithm, whose pseudocode is shown in Algorithms 1, 2, that finds a cover containing large k-plexes that have small intersection. The rationale for this choice is driven by the sought-after effect on graph pooling mechanisms in graph neural networks. On one hand, we seek to condense into a single community-node those neighboring nodes which are likely to share the same context and, hence, very similar embeddings. On the other hand, we would like the pooled graph to preserve diversity for nodes belonging to different communities, i.e. avoiding trivial aggregations which would induce heavy connectivity between the communities. Our algorithm is inspired to the clique covering framework in Conte et al. (2016Conte et al. ( , 2020, and leverages on heuristics that specifies the order on which nodes are considered for k-plex inclusion. Algorithm 1 receives in input this order by means of two priority functions f , g on V that are defined to provide large k-plexes with small pair-wise intersections. In practice, we fixed f , g to prioritise nodes with lower degree (for f ) and more neighbors in the k-plex (for g). A deeper discussion on priority functions is provided in Table 1.
Algorithm 1 uses Algorithm 2 as a subroutine, where the latter, starting from an uncovered node v (i.e., a node that is not included in the current cover), retrieves a k-plex S ⊆ V containing v. We discuss more in details both Algorithms 1 and 1 in the following. Algorithm 1 begins by iterating over the available nodes in the candidate set U , which is initialized with the whole set of nodes of the input graph. At each iteration, it selects the next candidate v by extracting the node in U with highest Table 1 Proposed priority functions for f and g, where π n represents a random permutation of the first n natural numbers

Priority
Value Update cost Note that every f is also a valid g, but not the other way round. The third column shows the computational cost of updating a priority for every iteration of the while loop in Algorithm 2 priority f (v). The node v will be then used as a starting node for retrieving the next k-plex S ⊆ V , using Algorithm 2. The elements of S will then be removed from the set U of candidates, and S will be included in the output cover S. Note that the nodes in the k-plexes are removed from U but not from the graph, hence successive executions of Algorithm 2 may contain previously removed nodes. The algorithm stops when eventually all the nodes are removed from U , and then returns the cover S. Algorithm 2, instead, constructs a k-plex S starting from a given node v, which is the only element available at startup. It initializes a candidate set C of nodes that could be part of the k-plex, this time relying on N (v). Again, Algorithm 2 iterates over the nodes in C following the ordering defined by the priority g, and adds them to S. The main loop has the following invariants: where Eq. 2.6 states that every node u in S needs to have at least |S|−k adjacent nodes in S and Eq. 2.7 states that, when u has exactly |S| − k adjacent nodes in S, we cannot have nodes in C that are not adjacent to u (as it would break Eq. 2.6 if selected). As a result, any node from C can be added to S. Both invariants are satisfied at the first iteration as all the candidates in C = N (v) are adjacent to v, which is the only element in S. At each iteration, Algorithm 2 preserves Eq. 2.6 by selecting a node u from C according to priority g. It needs to preserve Eq. 2.7 as S changes because of the addition of u. The first two for loops remove the nodes from C that no longer can be added to S. The third loop adds to C the nodes adjacent to u which can be added to S. After that, g is updated.
The above invariants guarantee that any S retrieved by Algorithm 2 is a k-plex. As Algorithm 1 terminates only when all the nodes have been covered by at least one k-plex in the current solution S, we obtain that S is a k-plex cover, hence proving the correctness of Algorithm 1.

Priority functions
To effectively summarize the graph for pooling purposes, we need to cover all the nodes with as few k-plexes as possible: we thus want these k-plexes to be large and with small overlaps. Following the experimented heuristics in Conte et al. (2016Conte et al. ( , 2020, we adapt the best performing priorities to the case of k-plexes. Specifically, the priorities f and g define how the nodes will be extracted during the execution of Algorithms 1, 2, guiding the covering process. Table 1 provides a selection of priority functions along with their computational cost. They span from random baseline policies to orders induced by the topology of the pooled graph and intended to yield pooled graph with interesting structural properties. Random and max-degree priorities should be quite self-explanatory; max-uncovered returns the number of neighbors that are not yet covered by a k-plex in S. Every f priority is also a valid strategy for g. Concerning the latter, we can also consider i max-in-k-plex, that assigns to every candidate node the number of its neighbors within the current k-plex S; and ii max-candidates, that assigns to every candidate node the number of its neighbors within the current candidate set C. For the experimental assessment in Sect. 4, we implement the cover priority f as the concatenation of min-degree and max-uncovered. Here the term concatenation denotes the fact that nodes are ordered following a lexicographic ordering defined first on mindegree with ties broken on max-uncovered (note that min-degree is discrete valued and many nodes are likely to have the same priority). The k-plex priority g is obtained as the concatenation of max-in-k-plex, max-candidates and min-uncovered, following the lexicographic ordering approach described for f . Both of the concatenated priorities, respectively for f and g, will be referred to as default in the following. Note that min-priorities can be trivially obtained by the respective max-definition. The choice of these combinations of policies is driven by the observation that they are able to generate large k-plexes while maintaining small intersections between them which, again, is a key property we would like to induce in our pooling mechanism. Figures 3, 4 show two useful statistics obtained executing Algorithm 1 with various combinations of the proposed priority functions. Specifically: -the average number of clusters in a cover (Fig. 3), which we aim to minimize, since a high number of clusters can be a signal of large intersections and/or small k-plexes; -the average number of occurrences of nodes in a cover (Fig. 4), which again we want to minimize, since a node has more than one occurrence if it belongs to multiple k-plexes and, thus, lies in an intersection between them.

Cover post-processing
On certain kinds of graphs, like star graphs or scale-free networks, few hub nodes have a degree that greatly exceeds the average on the graph: Algorithm 1 may include them in many distinct (and all pairwise adjacent) k-plexes, generating dense artifacts that do not well represent the graph. To overcome this problem, we discuss a cover post-processing mechanism, referred to as hub promotion, that sparsifies the coarsened graph by changing the cover assignments, removing hub nodes from the sets of the cover and assigning each to a dedicated singleton set. Specifically, for a given cover S of a graph G, and a threshold value p ∈ [0, 100], we proceed as follows. The effect of this method is assessed on social network datasets in Sect. 4. The results of a detailed ablation study, including both molecular and social graphs, is reported in Sect. 4.1.

Computational cost
It can be easily shown that Algorithm 2 takes O(m) time, since we can efficiently store and update g with standard data structures: Indeed, as each node v is added to S or C and removed from C at most once, and each update costs O(d(v) KPlexPool performs pooling steps. In each step, it finds a cover via Algorithm 1, taking O(nm) time, then performs feature aggregating in the clusters by multiplying two matrices of sizes at most n × n (the cover has at most n k-plexes). The best known cost for such multiplication is O(n ω ), where ω < 2.373 (Le Gall 2014), giving a total cost of O((nm + n ω ) ) time.

Related works
Models in literature can be partitioned in two general classes, based on whether they tackle the structure reduction problem using a topological or an adaptive approach ). The former exploits solely structural and community information conveyed by the sample topology. The latter generates the graph reduction using the information on the node embeddings, leveraging adaptive parameters that are trained together with graph convolution parameters. Several topological pooling algorithms are based on a clustering of the adjacency matrix. METIS (Karypis and Kumar 1998) and Graclus (Dhillon et al. 2007) are multi-level clustering algorithms that partition the graph by a sequence of coarsening and refinement phases. , instead, aggregates the attribute vectors of the nodes within the same clique. Adaptive pooling is, again, largely based on clustering but, differently from topological approaches, this is performed on the neural embeddings obtained from the graph convolutional layers. These methods rely on a parameterized clustering of the neural activations which entails that they need to preserve differentiability. DiffPool (Ying et al. 2018) has pioneered the approach by introducing a hierarchical clustering that soft-assigns each node to a fixed number of clusters c, generating S ∈ R n×c with another GNN. This process does not yield to an actual reduction of the adjacency matrix, as this would infringe differentiability, resulting in highly parameterized models of considerable complexity. MinCutPool ( Bianchi et al. 2020) further improve this technique by optimizing a relaxed version of the normalized-cut objective. StructPool (Yuan and Ji 2019), instead, computes the soft-clustering associations through a Conditional Random Field (CRF). gPool (Gao and Ji 2019;Cangea et al. 2018), also known as TopKPool (Fey and Lenssen 2019), addresses this shortcoming by learning a single vector p ∈ R h in , used to compute projection scores that serve to retain the top kn ranked nodes. SAGPool (Lee et al. 2019;Knyazev et al. 2019) later extended TopKPool to compute the score vector by means of a GNN. GSAPool (Zhang et al. 2020) further extends this model by a convex combination of two projection vectors: one learned by a GNN, and another by a classical neural network (with no topology information). The authors of ASAPool (Ranjan et al. 2020) showed the limitations of using a standard GCN (Kipf and Welling 2017) to compute the projection scores, and defined another convolution for graphs (LEConv) for that specific purpose. EdgePool (Diehl 2019;Diehl et al. 2019) reduces the graph by edge contraction, based on a score computed by a neural model that takes the edge incident nodes as input.
KPlexPool falls into the family of topological pooling but proposes a radically different approach to adjacency clustering models, which is based on well-grounded concepts from graph algorithmics. CliquePool is the most closely related model, but it considers a much more restricted and less flexible form of community than ours. Also, CliquePool is limited to simple graph partitions, whereas our approach can leverage the flexibility of assignments of graph covers. The effect of the greater generality and flexibility of KPlexPool is evident by its excellent empirical performances on a variety of graph topologies (Sect. 4). When compared to adaptive pooling methods, KPlexPool has certainly the disadvantage of relying on graph reductions that are not driven by the predictive task. Nonetheless, the experimental assessment shows that KPlexPool can achieve state of the art performances on both molecular and social graphs, also outperforming adaptive approaches where pooling is learned to optimize the predictive task.
For fairness, each pooling method has been tested by plugging it into the same standardized architecture (Baseline), comprising convolutional blocks, followed Top-, SAG-, Diff-, MinCutPool r 1/4, 1/2, 3/4 GNN is the convolution type, the number of layers, η the learning rate, h the hidden size, r the reduction factor, and k the k-plex value by two dense layers, the latter interleaved by dropout with probability 0.3. Every convolutional block is formed by two GNN layers (either GCN (Kipf and Welling 2017) or GraphSAGE ( Hamilton et al. 2017)) with Jumping Knowledge (Xu et al. 2018) followed by a dense layer. After every convolutional block we have a global sum-pooling, and the concatenation of their resulting vector is batch-normalized and fed to the final dense block. Every layer, GNN or dense, has h units and a ReLU activation function (Goodfellow et al. 2016). For every other model, a pooling layer is placed after the first − 1 convolutional blocks and its output feed to the next block. All models have been implemented using PyTorch-Geometric (Fey and Lenssen 2019), which also provided implementations for Graclus, DiffPool, MinCutPool, TopKPool, and SAGPool. Leiden has instead been computed using the cuGraph library (RAPIDS Development 2018). We re-implemented CliquePool using the NetworkX library (Hagberg et al. 2008), which provided an implementation of the Bron and Kerbosch (1973) algorithm and its optimizations (Tomita et al. 2006;Cazals and Karande 2008). Our experimental approach followed the standardized reproducible setting in Errica et al. (2020). Specifically, for model assessment, we used a stratified 10-fold crossvalidation and for model selection an inner stratified shuffle split, generating a validation set of the same size of the outer fold and leaving the remaining examples as training set. We performed a grid search for each fold, using the parameter space listed in Table 2. For each parameter combination, we trained each model with early-stopping after 20 epochs, monitored on the validation set. We selected the one that obtained the highest accuracy on the validation set evaluated it on the outer fold. We tested different configurations of KPlexPool depending on the domain. Since graphs become denser after each pooling layer, on biological datasets we tested the effect of a progressive reduction factor r k ∈ (0, 1], so that pooling at layer has k ( ) = r k k ( −1) . Here we also used the concatenation of sum and max as aggregation function on non-parametric pooling methods, i.e. Leiden, Graclus, CliquePool and KPlexPool. This could not be applied to selection-based methods (TopKPool, SAGPool), nor to the soft-clustering in DiffPool and MinCutPool.
For the sake of conciseness, in the following, we summarize the empirical results that assess the effect of the post-processing technique described in Sect. 2.6. Section Bold highlights the best performing model. OOR stands for out of resources, and r k is the incremental reduction factor for k 4.1 further provides an ablation study showing how k, the progressive reduction factor r k , and the hub promotion threshold contribute to the result. All models were trained on a NVIDIA ® V100 GPU with 16GB of dedicated memory, while the coverings were pre-computed on CPU, a Intel ® Xeon ® Gold 6140M with 1.23TB of RAM. KPlexPool has been implemented both in sparse coordinate form, which is slower but space efficient, and in matrix form, which is space-inefficient but apt to GPU parallelization. In the experiments, we used the latter except for DD, REDDIT-B and REDDIT-5K, as they contain larger graphs. For the same reason, DiffPool could not be trained on these datasets, since it required too much memory unless using only six samples per batch, thus requiring longer training times (more than two weeks per experiment). For the experiments in which DiffPool hits the outof-resource limit, we report results from Errica et al. (2020), which have been obtained under similar, although not fully equivalent, conditions. The coarsened graphs' topologies for non-parametric methods were pre-computed once at the beginning of every experiment, while their node features were aggregated at every training step.
Tables 3, 4 show the mean accuracy and standard deviation on test data (outer fold). On chemical datasets, KPlexPool yields competitive results on all datasets, with higher performances than other related methods on all benchmarks but ENZYMES. The application of the incremental reduction r k to the k value provides sensible benefits only on ENZYMES, while on other tasks effects are superficial at most.
On social benchmarks, KPlexPool performs better than parametric pooling models on all datasets, when considering the same experimental conditions. DiffPool is out-of-resources on REDDIT data, but KPlexPool is still competitive also with respect to DiffPool results from Errica et al. (2020). On REDDIT-5K, only the topological pooling of Graclus yields to higher accuracy. Applying hub-promotion (Sect. 2.6) increases KPlexPool performance on IMDB-M and REDDIT-B, as shown in Table 4. Its community-seeking bias seems certainly very adequate for the processing of social graphs, where adaptive pooling methods do not seem capable of leveraging their parameters to produce more informative graph reductions. Perhaps surprisingly, Bold highlights the best performing model. OOR stands for out of resources, and p is the percentile threshold value used for hub-promotion KPlexPool achieves excellent performances also on molecular data, where we would have expected adaptive models to have an edge, confirming our initial intuition about the flexibility and generality of k-plex cover communities. These results can be well appreciated in Table 5, where we report the average rank of each model separately on chemical, social, and the union of all datasets, with respect to the results shown in Tables 3, 4. For KPlexPool, we report the average rank considering a plain configuration (i.e. no incremental reduction nor hub-promotion) as well as for the best and worst performing configurations. These results highlight how the performances of KPlexPool are remarkably stable with respect to the choice of its configuration options (cover post-processing methods). A "-" implies the experiment is equivalent to the closest on its left, as the smaller r k value reduced k in the same manner

Ablation studies and practical considerations
For completeness, we performed ablation studies aimed at analyzing the every component of KPlexPool contributes to the overall performance. Table 6 compares the results obtained on ENZYMES, NCI-1, and PROTEINS using KPlexPool with different combinations of k and r k . We used the experimental approach described in Sect. 4, but restricted the hyper-parameter space (Table 2) by fixing = 3, h = 64, and GNN = CGN. As anticipated in Sect. 4, KPlexPool appears to yield a better accuracy by aggregating 2-to 4-plexes instead of simple cliques, and the best values (highlighted in bold) are obtained for k = 2 on two out of the three datasets. Applying a reduction factor seem to be more effective with larger k values: This could be caused by the fact that a large k generates a cover containing few, large, sets; hence, pooling layers beyond the first one will work on a smaller and denser coarsened graph, with denser inner communities, that is better summarized by cliques or k-plexes with small k values. Table 7 compares instead the results obtained on COLLAB, IMDB-B, and IMDB-M using the same approach as above, but varying k and p (Sect. 2.6), and fixing = 2.
Considering the generation process of the networks (see Yanardag and Vishwanathan 2015), we note that COLLAB is obtained by connecting authors that co-authored a paper, while IMDB-B and IMDB-M by connecting actors/actresses that co-starred in a movie. Furthermore, IMDB-M includes more movies than IMDB-B. We can thus imagine that IMDB-M undergoes a more prominent presence of hubs compared to IMDB-B due to the "rich gets richer" phenomenon (i.e., if we increase the number of movies considered, more famous actors have a greater probability of being featured in these movies and thus receiving more connection). This is reflected on the data in Table 7: we can observe how the best value for IMDB-M is obtained when p = 80 (i.e., the top 20% of nodes is considered a hub) whereas the best value of IMDB-B is obtained for p = 90 (i.e., the top 10% is considered a hub). More in general, we can observe that Table 7 motivates the usage of the hub promotion parameter p, as in all three datasets considered we obtain benefits by setting it as a non-trivial value (we recall that p = 100 corresponds to not using hub promotion).
As for varying the k values, we do not observe any significant trend. This is perhaps unsurprising, considering that these datasets are obtained by turning the co-authors of a paper (resp., co-stars of a movie) into a clique, thus k = 1 may be general enough for our requirements in some cases, although we observe that the best results are found on different lines for different values of p, and in particular the best value of IMDB-M is found for k = 4.
For an optimal choice of parameters, the ideal strategy would be to find the most effective values of k from the hyper-parameter tuning process. However, we observe that the choice k = 2 seems to have good overall performance, being often the best result or close to it: if it is required to use the approach on the fly, as a rule of thumb we suggest trying low values of k.

Conclusions
We have introduced KPlexPool, a novel graph pooling methodology leveraging kplexes and graph covers. Starting from consolidated graph-theoretic concepts and recent results on scalable community detection (Conte et al. 2018), we have built a flexible graph reduction approach that works effectively across structures with different topological properties. We have provided a general formulation that can account for different node inspection schedules and that can, in principle, be tailored based on prior or domain knowledge. Nevertheless, in the paper, we discussed the effectiveness of the approach for a fixed node ordering heuristic and cover post-processing strategy, showing the effectiveness of the method even in the plainest configuration.
The resulting KPlexPool algorithm has state-of-the-art performances in 7 out of 9 graph classification benchmarks. KPlexPool is shown to be the best performing method, on average, when confronted with related pooling mechanisms from literature. It does so through a fully topological approach that does not leverage task information for community building. None of the related models, including the adaptive ones, seem to have the same ability to cope effectively with structures of radically different nature (molecules and social networks). Apart from predictive performance, KPlexPool has a very practical advantage in terms of computational cost, when compared to adaptive models such as DiffPool. For instance, its graph reduction can be pre-computed once for the whole dataset and re-used throughout the whole model selection and validation phase, as it does not depend on adaptive node embedding. This aspect is clear from the empirical results, in which DiffPool is shown to fail to complete training within the 2 weeks limit (or to exceed the available GPU memory) on datasets comprising larger graphs. Conversely, as the proposed k-plex cover algorithm is mostly sequential, computing KPlexPool on the fly during the training loop will produce an overhead due to GPU-CPU synchronizations. To overcome this limitation, we need to design a parallel alternative for our algorithm that could also run on GPU. We left this as a future work.