1 Introduction

In the last few years, the definition of machine learning methods, particularly neural networks, for graph-structured inputs has been gaining increasing attention in literature (Defferrard et al. 2016; Errica et al. 2020). In particular, graph convolutional networks (GCNs), based on the definition of a convolution operator in the graph domain, are relatively fast to compute and have shown good predictive performance. Graph convolutions (GC) are generally based on a neighborhood aggregation scheme (Gilmer et al. 2017) considering, for each node, only its direct neighbors. Stacking multiple GC layers, the size of the receptive field of deeper filters increases (resembling standard convolutional networks). However, stacking too many GC layers may be detrimental on the network ability to represent meaningful topological information (Li et al. 2018) due to a too high Laplacian smoothing. Moreover, in this way interactions among GC parameters at different layers pose a bias on the flow of topological information, as we will discuss in this paper. For these reasons, several convolution operators have been defined in literature, differing from one another in the considered aggregation scheme. We argue that the performance of GC networks could benefit by increasing the size of the receptive fields, but since with existing GC architectures this effect can only be obtained by stacking more GC layers, the increased difficulty in training and the limitation of expressiveness given by the stacking of many local layers ends up hurting their predictive capabilities.

Consequently, the performances of existing GCNs are strongly dependent on the specific architecture. Therefore, existing graph neural network performances are limited by (i) the necessity to select a specific convolution operator, and (ii) the limitation of expressiveness caused by large receptive fields being possible only stacking many local layers.

In this paper, we tackle both issues following a different strategy. We propose the polynomial graph convolution (PGC) layer that independently considers neighbouring nodes at different topological distances (i.e. arbitrarily large receptive fields). We show that the PGC layer is more general than most convolution operators in literature. As for the second issue, a single PGC layer, directly considering larger receptive fields, can represent a richer set of functions compared to the linear stacking of two or more graph convolution layers, i.e. it is more expressive. Moreover, the linear PGC design allows to consider large receptive fields without incurring in typical issues related to training deep networks. We developed the polynomial graph convolutional network (PGCN), an architecture that exploits the PGC layer to perform graph classification tasks. We empirically evaluate the proposed PGCN on eight commonly adopted graph classification benchmarks. We compare the proposed method to several state-of-the-art GCNs, consistently achieving higher or comparable predictive performances. Differently from other works in literature, the contribution of this paper is to show that the common approach of stacking multiple GC layers may not provide an optimal exploitation of topological information because of the strong coupling of the depth of the network with the size of the topological receptive fields. In our proposal, the depth of the PGCN is decoupled from the receptive field size, allowing to build deep GNNs avoiding the oversmoothing problem.

2 Notation

We use italic letters to refer to variables, bold lowercase to refer to vectors, and bold uppercase letters to refer to matrices. The elements of a matrix \({\mathbf {A}}\) are referred to as \(a_{ij}\) (and similarly for vectors). We use uppercase letters to refer to sets or tuples. Let \(G=(V,E,{\mathbf {X}})\) be a graph, where \(V=\{v_0, \ldots , v_{n-1}\}\) denotes the set of vertices (or nodes) of the graph, \(E \subseteq V \times V\) is the set of edges, and \({\mathbf {X}} \in {\mathbb {R}}^{n\times s}\) is a multivariate signal on the graph nodes with the i-th row representing the attributes of \(v_i\). We define \({\mathbf {A}} \in {\mathbb {R}}^{n \times n}\) as the adjacency matrix of the graph, with elements \(a_{ij}=1 \iff (v_i,v_j)\in E\). With \({\mathcal {N}}(v)\) we denote the set of nodes adjacent to node v. Let also \({\mathbf {D}} \in {\mathbb {R}}^{n \times n}\) be the diagonal degree matrix where \(d_{ii}=\sum _j a_{i j}\), and \({\mathbf {L}}\) the normalized graph laplacian defined by \({\mathbf {L}} = {\mathbf {I}}- {\mathbf {D}}^{-\frac{1}{2}}{\mathbf {A}}{\mathbf {D}}^{-\frac{1}{2}}\), where \({\mathbf {I}}\) is the identity matrix.

With \(GConv_{\theta }({\mathbf {x}}_v,G)\) we denote a graph convolution with set of parameters \(\theta\). A GCN with k levels of convolutions is denoted as \(GConv_{\theta _k}(\ldots GConv_{\theta _1}({\mathbf {x}}_v,G)\ldots ,G)\). For a discussion about the most common GCNs we refer to “Appendix A”. We indicate with \(\hat{{\mathbf {X}}}\) the input representation fed to a layer, where \(\hat{{\mathbf {X}}}={{\mathbf {X}}}\) if we are considering the first layer of the graph convolutional network, or \(\hat{{\mathbf {X}}}={\mathbf {H}}^{(i-1)}\) if considering the i-th graph convolution layer.

3 Background

The derivation of the graph convolution operator originates from graph spectral filtering (Defferrard et al. 2016). In order to set up a convolutional network on G, we need the notion of a convolution \(*_G\) between a signal \({\mathbf {x}}\) and a filter signal \({\mathbf {f}}\) The Spectral convolution (Defferrard et al. 2016) can be considered the application of Fourier transform to graphs. It is obtained via Chebyshev polynomials of the Laplacian matrix. In general, the usage of a Chebyshev basis improves the stability in numerical approximation, and is defined as:

$$\begin{aligned} T^{(0)}(x)=1, T^{(1)}(x)=x, T^{(k)}(x)=2xT^{(k-1)}(x)- T^{(k-2)}(x). \end{aligned}$$

A graph filter can be defined as:

$$\begin{aligned} \hat{{\mathbf {F}}}_{\varvec{\theta }} = \sum _{i=0}^{k^\star } \theta _i T^{(i)}(\tilde{\mathbf {\Lambda }}), \end{aligned}$$
(1)

where \(\tilde{\mathbf {\Lambda }}=\frac{2\mathbf {\Lambda }}{\lambda _{max}}-{\mathbf {I}}_n\) is the diagonal matrix of scaled eigenvectors of the graph Laplacian. The resulting convolution is then:

$$\begin{aligned} {\mathbf {f}}_{\varvec{\theta }} *_G {\mathbf {x}} = \sum _{i=0}^{k^\star } \theta _i T^{(i)}(\tilde{{\mathbf {L}}}) {\mathbf {x}}, \end{aligned}$$
(2)

where \(\tilde{{\mathbf {L}}}=\frac{2{\mathbf {L}}}{\lambda _{max}}-{\mathbf {I}}_n\).

The graph convolution (Kipf and Welling 2017) (GCN) is a simplification of the spectral convolution. The authors propose to fix the order \(k^\star =1\) of the Chebyshev spectral convolution in Eq. (2) to obtain a linear first order graph convolution filter. These simple convolutions can then be stacked in order to improve the discriminatory power of the resulting network. The resulting convolution operator in Kipf and Welling (2017) is defined as:

$$\begin{aligned} {\mathbf {H}}^{(i+1)} = (\tilde{{\mathbf {D}}}^{-\frac{1}{2}} \tilde{{\mathbf {A}}}\tilde{{\mathbf {D}}}^{-\frac{1}{2}}) {\mathbf {H}}^{(i)} {\mathbf {W}}^i, \end{aligned}$$
(3)

where \(\tilde{{\mathbf {A}}}= \mathbf {I_n} + {\mathbf {A}}\), \(\mathbf {I_n}\) is the \(n \times n\) identity matrix, \(\tilde{{\mathbf {D}}}\) is a diagonal matrix with entries \({\tilde{d}}_{ii}=\sum _j {\tilde{a}}_{i j}\), and \({\mathbf {H}}^{(0)}={\mathbf {X}}\).

Morris et al. (2019) defined the GraphConv operator inspired by the Weisfeiler-Lehman graph invariant, defined as follows:

$$\begin{aligned} {\mathbf {H}}^{(i+1)}={\mathbf {H}}^{(i)}\bar{{\mathbf {W}}}^{(i)} + {\mathbf {A}}{\mathbf {H}}^{(i)}\hat{{\mathbf {W}}}^{(i)}. \end{aligned}$$
(4)

Xu et al. (2018) defined the graph isomorphism network (GIN) convolution. The GIN convolution exploits aggregation over node neighbors is implemented using an MLP, therefore the resulting GC formulation is the following one:

$$\begin{aligned} {\mathbf {H}}^{(i+1)} = MLP ( (1+\epsilon ){\mathbf {H}}^{(i)} + {\mathbf {A}}{\mathbf {H}}^{(i)}). \end{aligned}$$
(5)

4 Polynomial graph convolution (PGC)

In this section, we introduce the polynomial graph convolution (PGC), able to simultaneously and directly consider all topological receptive fields up to \(k-hops\), just like the ones that are obtained by the graph convolutional layers in a stack of size k. PGC, however, does not incur in the typical limitation related to the complex interaction among the parameters of the GC layers. Actually, we show that PGC is more expressive than the most common convolution operators. Moreover, we prove that a single PGC convolution of order k is capable of implementing k linearly stacked layers of convolutions proposed in the literature, providing also additional functions that cannot be realized by the stack. Thus, the PGC layer extracts topological information from the input graph decoupling in an effective way the depth of the network from the size of the receptive field. Its combination with deep MLPs allows to obtain deep graph neural networks that can overcome the common oversmoothing problem of current architectures. The basic idea underpinning the definition of PGC is to consider the case in which the graph convolution can be expressed as a polynomial of the powers of a transformation \({{{\mathcal {T}}}}\) of the adjacency matrix. This definition is very general, and thus it incorporates many existing graph convolutions as special cases. Given a graph \(G = (V,E,{\mathbf {X}})\) with adjacency matrix \({\mathbf {A}}\), the polynomial graph convolution (PGC) layer of degree k, transformation \({{{\mathcal {T}}}}\) of \({\mathbf {A}}\), and size m, is defined as

$$\begin{aligned} PGConv_{k,{{{\mathcal {T}}}},m}({\mathbf {X}},{\mathbf {A}}) = {\mathbf {R}}_{k,{{{\mathcal {T}}}}}{\mathbf {W}}, \end{aligned}$$
(6)

where \({{{\mathcal {T}}}}: \bigcup _{j=1}^\infty ({\mathbb {R}}^{j \times j} \rightarrow {\mathbb {R}}^{j \times j})\) is a generic transformation of the adjacency matrix that preserves its shape, i.e. \(\mathcal{T}({\mathbf {A}})\in {\mathbb {R}}^{n \times n}\). For instance, \({{{\mathcal {T}}}}\) can be defined as the function returning the Laplacian matrix starting from the adjacency matrix. Moreover, \({\mathbf {R}}_{k,\mathcal{T}} \in {\mathbb {R}}^{n \times s(k+1)}\), is defined as

$$\begin{aligned} {\mathbf {R}}_{k,{{{\mathcal {T}}}}} =[ {\mathbf {X}}, {{{\mathcal {T}}}}({\mathbf {A}}) {\mathbf {X}},{{{\mathcal {T}}}}({\mathbf {A}})^2 {\mathbf {X}},.., \mathcal{T}({\mathbf {A}})^k {\mathbf {X}}], \end{aligned}$$

and \({\mathbf {W}} \in {\mathbb {R}}^{s(k+1) \times m}\) is a learnable weight matrix. For the sake of presentation, we will consider \({\mathbf {W}}\) as composed of blocks: \({\mathbf {W}}= [{\mathbf {W}}_0,\ldots ,{\mathbf {W}}_k]^\top\), with \({\mathbf {W}}_j \in {\mathbb {R}}^{s \times m}\). In the following, we show that PGC is very expressive, able to implement commonly used convolutions as special cases.

4.1 Graph convolutions in literature as PGC instantiations

The PGC layer in (6) is designed to be a generalization of the linear stacking of some of the most common spatially localized graph convolutions. The idea is that spatially localized convolutions aggregate over neighbors (the message passing phase) using a transformation of the adjacency matrix (e.g. a normalized graph Laplacian). We provide in this section a formal proof, as a theoretical contribution of this paper, that linearly stacked convolutions can be rewritten as polynomials of powers of the transformed adjacency matrix.

We start by showing how common graph convolution operators can be defined as particular instances of a single PGC layer (in most cases with \(k=1\)). Then, we prove that linearly stacking any two PGC layers produces a convolution that can be written as a single PGC layer as well.

Spectral A layer of Spectral convolutions of order \(k^\star\) defined in Eq. (2), can be implemented by a single PGC layer instantiating \({{{\mathcal {T}}}}(A)\) to be the graph Laplacian matrix (or one of its normalized versions), setting the PGC k value to \(k^\star\), and setting the weight matrix to encode the constraints given by the Chebyshev polynomials. For instance, we can get the output \({\mathbf {H}}\) of a Spectral convolution layer with \(k^\star =3\) by the following PGC:

$$\begin{aligned} {\mathbf {H}}&=[ {\mathbf {X}}, {\mathbf {L}} {\mathbf {X}}, {\mathbf {L}}^2, {\mathbf {L}}^3 {\mathbf {X}}]{\mathbf {W}}, \text { where } {\mathbf {W}}= \left[ {\begin{array}{c} {\mathbf {W}}_0 - {\mathbf {W}}_2 \\ {\mathbf {W}}_1 -3{\mathbf {W}}_3 \\ 2{\mathbf {W}}_2\\ 4{\mathbf {W}}_3 \end{array} } \right] ,\ \ {\mathbf {W}}_{i} \in {\mathbb {R}}^{s \times m}. \end{aligned}$$
(7)

GCN The convolution defined in Eq. (3) can be obtained in the PGC framework by setting \(k=1\) and \({{{\mathcal {T}}}}(A)=\tilde{ {\mathbf {D}}}^{-\frac{1}{2}} \tilde{{\mathbf {A}}}\tilde{{\mathbf {D}}}^{-\frac{1}{2}} \in {\mathbb {R}}^{n \times n}\). We obtain the following equivalent equation:

$$\begin{aligned} {\mathbf {H}}&=[ {\mathbf {X}}, \tilde{ {\mathbf {D}}}^{-\frac{1}{2}} \tilde{{\mathbf {A}}}\tilde{{\mathbf {D}}}^{-\frac{1}{2}} {\mathbf {X}}] {\mathbf {W}}, \text { where } {\mathbf {W}}= \left[ {\begin{array}{c} {\mathbf {0}} \\ {\mathbf {W}}_1 \\ \end{array} } \right] , \end{aligned}$$
(8)

where \({\mathbf {0}}\) is a \(s \times m\) matrix with all entries equal to zero and \({\mathbf {W}}_{1} \in {\mathbb {R}}^{s \times m}\) is the weight matrix of GCN. Note that the GCN does not consider a node differently from its neighbors, thus in this case there is no contribution from the first component of \({\mathbf {R}}_{k,{{{\mathcal {T}}}}}\).

GraphConv The convolution defined in Eq. (4) can be obtained by setting \({{{\mathcal {T}}}}(A)=A\) (the identity function), and k is again set to 1. A single GraphConv layer can be written as:

$$\begin{aligned} {\mathbf {H}}&=[ {{\mathbf {X}}}, {\mathbf {A}} {{\mathbf {X}}}] {\mathbf {W}}, \text { where } {\mathbf {W}}= \left[ {\begin{array}{c} {\mathbf {W}}_0 \\ {\mathbf {W}}_1 \\ \end{array} } \right] ,\text { and } {\mathbf {W}}_{0},{\mathbf {W}}_{1} \in {\mathbb {R}}^{s \times m}. \end{aligned}$$
(9)

GIN

Technically, the GIN presented in eq (5) is a composition of a convolution (that is a linear operator) with a multi-layer perceptron. Let us thus decompose the MLP() as \(f\circ g\), where g is an affine projection via weight matrix \({\mathbf {W}}\), and f incorporates the element-wise non-linearity, and the other layers of the MLP. We can thus isolate the GIN graph convolution component and define it as a specific PGC istantiation. We let \(k=1\) and \(\mathcal{T}()\) the identity function as before. A single GIN layer can then be obtained as:

$$\begin{aligned} {\mathbf {H}}&=[ {\mathbf {X}}, {\mathbf {A}} {\mathbf {X}}] {\mathbf {W}}, \text { where } {\mathbf {W}}= \left[ {\begin{array}{c} (1+\epsilon ){\mathbf {W}}_1 \\ {\mathbf {W}}_1 \\ \end{array} } \right] . \end{aligned}$$
(10)

Note that, differently from GraphConv, in this case the blocks of the matrix \({\mathbf {W}}\) are tied. Figure 3 in “Appendix B” depicts the expressivity of different graph convolution operators in terms of the respective constraints on the weight matrix \({\mathbf {W}}\). The comparison is made easy by the definition of the different graph convolution layers as instances of PGC layers. Actually, it is easy to see from Eqs. (810) that GraphConv is more expressive than GCN and GIN.

4.2 Linearly stacked graph convolutions as PGC instantiations

In the previous section, we have shown that common graph convolutions can be expressed as particular instantiations of a PGC layer. In this section, we show that a single PGC layer can model the linear stacking of any number of PGC layers (using the same transformation \({\mathcal {T}}\)). Thus, a single PGC layer can model all the functions computed by arbitrarily many linearly stacked graph convolution layers defined in the previous section. We then show that a PGC layer includes also additional functions compared to the stacking of simpler PGC layers, which makes it more expressive.

Theorem 1

Let us consider two linearly stacked PGC layers using the same transformation \({\mathcal {T}}\). The resulting linear Graph Convolutional network can be expressed by a single PGC layer.

Due to space limitations, the proof is reported in “Appendix C”. Here it is important to know that the proof of Theorem 1 tells us that a single PGC of order k can represent the linear stacking of any q (\({\mathcal {T}}\)-compatible) convolutions such that \(k=\sum _{i=1}^{q} d_i\), where \(d_i\) is the degree of convolution at level i. We will now show that a single PGC layer can represent also other functions, i.e. it is more general than the stacking of existing convolutions. Let us consider, for the sake of simplicity, the stacking of 2 PGC layers with \(k=1\) [that are equivalent to GraphConv layers, see Eq. (9)], each with parameters \({\mathbf {W}}^{(i)}= [{\mathbf {W}}_0^{(i)},{\mathbf {W}}_1^{(i)}]^\top , \ i=1,2\), \({\mathbf {W}}^{(1)}_0, {\mathbf {W}}^{(1)}_1 \in {\mathbb {R}}^{s\times m_1},\ {\mathbf {W}}^{(2)}_0, {\mathbf {W}}^{(2)}_1 \in {\mathbb {R}}^{m_1\times m_2}\). The same reasoning can be applied to any other convolution among the ones presented in Sect. 4.1. We can explicitly write the equations computing the hidden representations:

$$\begin{aligned} {\mathbf {H}}^{(1)}&={{\mathbf {X}}}{\mathbf {W}}^{(1)}_{0} + {\mathbf {A}}{{\mathbf {X}}}{\mathbf {W}}^{(1)}_{1}, \end{aligned}$$
(14)
$$\begin{aligned} {\mathbf {H}}^{(2)}&={\mathbf {H}}^{(1)}{\mathbf {W}}^{(2)}_{0} + {\mathbf {A}}{\mathbf {H}}^{(1)}{\mathbf {W}}^{(2)}_{1}\nonumber \\&={{\mathbf {X}}}{\mathbf {W}}^{(1)}_{0}{\mathbf {W}}^{(2)}_{0} +{\mathbf {A}}{{\mathbf {X}}}({\mathbf {W}}^{(1)}_{1}{\mathbf {W}}^{(2)}_{0} + {\mathbf {W}}^{(1)}_{0}{\mathbf {W}}^{(2)}_{1})+ {\mathbf {A}}^2{{\mathbf {X}}}{\mathbf {W}}^{(1)}_{1}{\mathbf {W}}^{(2)}_{1}. \end{aligned}$$
(15)

A single PGC layer can implement this second order convolution as:

$$\begin{aligned} {\mathbf {H}}^{(2)}=[{\mathbf {X}},{\mathbf {A}}{\mathbf {X}}, {\mathbf {A}}^2{\mathbf {X}}] \left[ {\begin{array}{c} {\mathbf {W}}^{(1)}_{0}{\mathbf {W}}^{(2)}_{0}\\ {\mathbf {W}}^{(1)}_{1}{\mathbf {W}}^{(2)}_{0} + {\mathbf {W}}^{(1)}_{0}{\mathbf {W}}^{(2)}_{1} \\ {\mathbf {W}}^{(1)}_{1}{\mathbf {W}}^{(2)}_{1} \\ \end{array} } \right] . \end{aligned}$$
(16)

Let us compare it with a PGC layer that corresponds to the same 2-layer architecture but that has no constraints on the weight matrix, i.e.:

$$\begin{aligned} {\mathbf {H}}^{(2)}=[{\mathbf {X}},{\mathbf {A}}{\mathbf {X}}, {\mathbf {A}}^2{\mathbf {X}}] \left[ {\begin{array}{c} {\mathbf {W}}_{0}\\ {\mathbf {W}}_{1} \\ {\mathbf {W}}_{2} \\ \end{array} } \right] ,\ {\mathbf {W}}_{i}\in {\mathbb {R}}^{s\times m_2},\ i=0,1,2. \end{aligned}$$
(17)

Even though it is not obvious at a first glance, (16) is more constrained than (17), i.e. there are some values of \({\mathbf {W}}_{0},{\mathbf {W}}_{1},{\mathbf {W}}_{2}\) in (17) that cannot be obtained for any \({\mathbf {W}}^{(1)} = [{\mathbf {W}}^{(1)}_0, {\mathbf {W}}^{(1)}_1]^\top\) and \({\mathbf {W}}^{(2)}=[{\mathbf {W}}^{(2)}_0,{\mathbf {W}}^{(2)}_1]^\top\) in (16), as proven by the following theorem.

Theorem 2

A PGC layer with \(k=2\) is more general than two stacked PGC layers with \(k=1\) with the same number of hidden units m.

We refer the reader to “Appendix C” for the proof. Notice that the GraphConv layer is equivalent to a PGC layer with \(k=1\) (if no constraints on \({\mathbf {W}}\) are considered, see later in this section). Since the GraphConv is, in turn, more general than GCN and GIN, the above theorem holds also for those graph convolutions. Moreover, Theorem 2 trivially implies that a linear stack of q PGC layers with \(k=1\) is less expressive than a single PGC layer with \(k=q\). In “Appendices D and E” we characterize the hypothesis that cannot be represented with the stacking approach, and we provide examples to provide more practical insights on such hypotheses and on the reason why they are not representable with stacking.

If we now consider that in many GCN architectures it is typical, and useful, to concatenate the output of all convolution layers before aggregating the node representations, then it is not difficult to see that such concatenation can be obtained by making wider the weight matrix of PGC. Let us thus consider a network that generates a hidden representation that is the concatenation of the different representations computed on each layer, i.e. \({\mathbf {H}}=[{\mathbf {H}}^{(1)},{\mathbf {H}}^{(2)}]\in {\mathbb {R}}^{s\times m},\ m=m_1+m_2\). We can represent a two-layer GraphConv network as a single PGC layer as:

$$\begin{aligned} {\mathbf {H}}=[{\mathbf {X}},{\mathbf {A}}{\mathbf {X}}, {\mathbf {A}}^2{\mathbf {X}}] \left[ \begin{array}{cc} {\mathbf {W}}^{(1)}_{0} &{}{\mathbf {W}}^{(1)}_{0}{\mathbf {W}}^{(2)}_{0}\\ {\mathbf {W}}^{(1)}_{1} &{} {\mathbf {W}}^{(1)}_{1}{\mathbf {W}}^{(2)}_{0} + {\mathbf {W}}^{(1)}_{0}{\mathbf {W}}^{(2)}_{1} \\ {\mathbf {0}} &{} {\mathbf {W}}^{(1)}_{1}{\mathbf {W}}^{(2)}_{1} \\ \end{array} \right] . \end{aligned}$$
(18)

More in general, if we consider k GraphConv convolutional layers (see Eq. (9)), each with parameters \({\mathbf {W}}^{(i)}= [{\mathbf {W}}_0^{(i)},{\mathbf {W}}_1^{(i)}]^\top\), \(i=1,\ldots ,k,\ {\mathbf {W}}_0^{(i)},{\mathbf {W}}_1^{(i)}\in {\mathbb {R}}^{m_{i-1}\times m_i},\ m_0=s,\ m = \sum _{j=1}^k\), the weight matrix \({\mathbf {W}}\in {\mathbb {R}}^{s\cdot (k+1)\times m}\) can be defined as follows:

$$\begin{aligned} \left[ {\begin{array}{cccc} F_{0,1}({\mathbf {W}}^{(1)}) &{} F_{0,2}({\mathbf {W}}^{(1)},{\mathbf {W}}^{(2)} ) &{} F_{0,3}({\mathbf {W}}^{(1)},{\mathbf {W}}^{(2)}, {\mathbf {W}}^{(3)} ) &{}\ldots \\ F_{1,1}({\mathbf {W}}^{(1)}) &{} F_{1,2}({\mathbf {W}}^{(1)},{\mathbf {W}}^{(2)} ) &{} F_{1,3}({\mathbf {W}}^{(1)},{\mathbf {W}}^{(2)},{\mathbf {W}}^{(3)} )&{}\ldots \\ {\mathbf {0}} &{} F_{2,2}({\mathbf {W}}^{(1)},{\mathbf {W}}^{(2)} ) &{} F_{2,3}({\mathbf {W}}^{(1)},{\mathbf {W}}^{(2)},{\mathbf {W}}^{(3)} )&{}\ldots \\ {\mathbf {0}} &{} {\mathbf {0}} &{} F_{3,3}({\mathbf {W}}^{(1)},{\mathbf {W}}^{(2)},{\mathbf {W}}^{(3)} )&{}\ldots \\ \ldots &{} \ldots &{} \ldots &{}\ldots \\ \end{array} } \right] , \end{aligned}$$
(19)

where \(F_{i,j}(),\ i,j \in \{0,\ldots ,k\} ,\ i\le j\), are defined as

$$\begin{aligned} F_{i,j}({\mathbf {W}}^{(1)},\ldots ,{\mathbf {W}}^{(j)}) = \sum _{{\mathop {s.t.\ \sum _{q=1}^{j}z_q = i}\limits ^{(z_1,..,z_j)\in \{0,1\}^j}}} \ \prod _{s=1}^{j}{\mathbf {W}}_{z_s}^{(s)}. \end{aligned}$$

We can now generalize this formulation by concatenating the output of \(k+1\) PGC convolutions of degree ranging from 0 up to k. This gives rise to the following definitions:

$$\begin{aligned} {\mathbf {W}}= \left[ {\begin{array}{ccccc} {\mathbf {W}}_{0,0} &{} {\mathbf {W}}_{0,1} &{} {\mathbf {W}}_{0,2} &{} \dots &{} {\mathbf {W}}_{0,k}\\ {\mathbf {0}} &{} {\mathbf {W}}_{1,1} &{} {\mathbf {W}}_{1,2}&{} \dots &{} {\mathbf {W}}_{1,k}\\ {\mathbf {0}} &{} {\mathbf {0}} &{} {\mathbf {W}}_{2,2} &{} \dots &{} {\mathbf {W}}_{2,k}\\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ {\mathbf {0}}&{} {\mathbf {0}} &{} {\mathbf {0}} &{} \dots &{} {\mathbf {W}}_{k,k}\\ \end{array} } \right] , \ {\mathbf {H}} =\left[ {\begin{array}{c} ({\mathbf {X}} {\mathbf {W}}_{0,0})^\top \\ ({\mathbf {X}} {\mathbf {W}}_{0,1} + {\mathcal {T}}({\mathbf {A}}) {\mathbf {X}} {\mathbf {W}}_{1,1})^\top \\ \vdots \\ ({\mathbf {X}}{\mathbf {W}}_{0,k}+ \dots +{\mathcal {T}}({\mathbf {A}})^{k} {\mathbf {X}}{\mathbf {W}}_{k,k})^\top \\ \end{array} } \right] ^\top , \end{aligned}$$
(20)

where we do not put constraints among matrices \({\mathbf {W}}_{i,j}\in {\mathbb {R}}^{s\times m_j},\ m =\sum _{j=0}^{k}m_j\), which are considered mutually independent. Note that as a consequence of Theorem 2, the network defined in (20) is more expressive than the one obtained concatenating different GraphConv layers as defined in (19). It can also be noted that the same network can actually be seen as a single PGC layer of order \(k+1\) with a constraint on the weight matrix (i.e., to be an upper triangular block matrix). Of course, any weights sharing policy can be easily implemented, e.g. by imposing \(\forall j\ {\mathbf {W}}_{i,j}= {\mathbf {W}}_{i}\), which corresponds to the concatenation of the representations obtained at level i by a single stack of convolutions. In addition to reduce the number of free parameters, this weights sharing policy allows the reduction of the computational burden since the representation at level i is obtained by summing to the representation of level \(i-1\) the contribution of matrix \({\mathbf {W}}_{i}\), i.e. \({\mathbf {A}}^i{\mathbf {X}}{\mathbf {W}}_{i}\)

4.3 Computational complexity

As detailed in the previous discussion, the degree k of a PGC layer controls the size of its considered receptive field. In terms of the number of parameters, fixing the node attribute size s and the size m of the hidden representation, the number of parameters of the PGC is \(O(s \cdot k \cdot m)\), i.e. it grows linearly in k. Thus, the number of parameters of a PGC layer is of the same order of magnitude compared to k stacked graph convolution layers based on message passing (Gilmer et al. 2017) (i.e. GraphConv, GIN and GCN, presented in Sect. 3).

If we consider the number of required matrix multiplications, compared to message passing GC networks, in our case it is possible to pre-compute the terms \({{{\mathcal {T}}}}({\mathbf {A}})^i {\mathbf {X}}\) before the start of training, making the computational cost of the convolution calculation cheaper compared to message passing. In “Appendix H”, we report an example that makes evident the significant improvement that can be gained in training time with respect to message passing.

5 Polynomial graph convolutional network (PGCN)

In this section, we present a neural architecture that exploits the PGC layer to perform graph classification tasks. Note that, differently from other GCN architectures, in our architecture (exploiting the PGC layer) the depth of the network is completely decoupled from the size k of the receptive field. The initial stage of the model consists of a first PGC layer with \(k=1\). The role of this first layer is to develop an initial node embedding to help the subsequent PGC layer to fully exploit its power. In fact, in bioinformatics datasets where node labels \({\mathbf {X}}\) are one-hot encoded, all matrices \({\mathbf {X}}, {\mathbf {A}}{\mathbf {X}}, \ldots , {\mathbf {A}}^{k}{\mathbf {X}}\) are very sparse, which we observed, in preliminary experiments, influences in a negative way learning. Table 9 in “Appendix I” compares the sparsity of the PGC representation using the original one-hot encoded labels against their embedding obtained with the first PGC layer. The analysis shows that using this first layer the network can work on significantly denser representations of the nodes. Note that this first stage of the model does not significantly bound the PGC-layer expressiveness. A dense input for the PGC layer could have been obtained by using an embedding layer that is not a graph convolutional operator. However, this choice would have made difficult to compare our results with other state-of-the-art models in Sect. 7, since the same input transformation could have been applied to other models as well, making unclear the contribution of the PGC layer to the performance improvement. This is why we decided to use a PGC layer with \(k=1\) (equivalent to a GraphConv) to compute the node embedding, making the results fully comparable since we are using only graph convolutions in our PGCN. For what concerns the datasets that do not have the nodes’ label (like the social networks datasets), using the PGC layer with \(k=1\) allows to create a label for each node that will be used by the subsequent larger PGC layer to compute richer node’s representations. After this first PGC layer, a PGC layer as described in Eq. (20) of degree k is applied. In order to reduce the number of hyperparameters, we adopted the same number \(\frac{m}{k+1}\) of columns (i.e., hidden units) for matrices \({\mathbf {W}}_{i,j}\), i.e. \({\mathbf {W}}_{i,j}\in {\mathbb {R}}^{s\times \frac{m}{k+1}}\). A graph-level representation \({\mathbf {s}}\in {\mathbb {R}}^{m*3}\) based on the PGC layer output \({\mathbf {H}}\) is obtained by an aggregation layer that exploits three different aggregation strategies over the whole set of nodes \(V,\ j=1,\ldots ,m\):

$$\begin{aligned} s^{avg}_j&={avg}(\{{h}_v^{(j)}| v \in V \} ), \ s^{max}_j ={max}(\{{h}_v^{(j)}| v \in V \} ),\ s^{sum}_j ={sum}(\{{h}_v^{(j)}| v \in V \} ),\\ {\mathbf {s}}&=[ s^{avg}_1,s^{max}_1,s^{sum}_1, \dots ,s^{avg}_{m},s^{max}_{m},s^{sum}_{m} ]^{\top }. \end{aligned}$$

The readout part of the model is composed of q dense feed-forward layers, where we consider q and the number of neurons per layer as hyper-parameters. Each one of these layers uses the ReLU activation function, and is defined as \({\mathbf {y}}_j = {ReLu}({\mathbf {W}}^{readout}_{j} {\mathbf {y}}_{j-1} + {\mathbf {b}}^{readout}),\; j=1,\ldots ,q, \;\) where \({\mathbf {y}}_0={\mathbf {s}}\). Finally, the output layer of the PGCN for a c-class classification task is defined as:

$$\begin{aligned} {\mathbf {o}} = {LogSoftmax}({\mathbf {W}}^{out} {\mathbf {y}}_q + {\mathbf {b}}^{out}). \end{aligned}$$

For a complete discussion about the reasons why we implement the readout network by an MLP, please refer to “Appendix J″.

6 Multi-scale GCN architectures in literature

Some recent works in literature exploit the idea of extending graph convolution layers to increase the receptive field size. In general, the majority of these models, that concatenate polynomial powers of the adjacency matrix A, are designed to perform node classification, while the proposed PGCN is developed to perform graph classification. In this regard, we want to point out that the novelty introduced in this paper is not limited to a novel GC-layer, but the proposed PGCN is a complete architecture to perform graph classification.

Atwood and Towsley (2016) proposed a method that exploits the power series of the probability transition matrix, that is multiplied (using Hadamard product) by the inputs. The method differs from the PGCN both in terms of how the activation is computed, and because the activation computed for each exponentiation is summed, instead been concatenated.

Similarly in Defferrard et al. (2016) the model exploits the Chebyshev polynomials, and, differently from PGCN, it sums them over k. This architectural choice makes the proposed method less general than the PGCN. Indeed, as showed in Sect. 4.1, the model proposed in Defferrard et al. (2016) is an instance of the PGC.

In Xu et al. (2018), the authors proposed to modify the common aggregation layer in such a way that, for each node, the model aggregates all the intermediate representations computed in the previous GC-layers. In this work, differently from PGCN, the model exploits the message passing method introducing a bias in the flow of the topological information. Note that, as proven in Theorem 2, a PGC layer of degree k is not equivalent to concatenating the output of k stacked GC layers, even though the PGC layer can also implement this particular architecture.

Another interesting approach is proposed in Tran et al. (2018), where the authors consider larger receptive fields compared to standard graph convolutions. However, they focus on a single convolution definition (using just the adjacency matrix) and consider shortest paths (differently from PGCN that exploits matrix exponentiations, i.e. random walks). In terms of expressiveness, it is complex to compare methods that exploit matrix exponentiations with methods based on the shortest paths. However, it is interesting to notice that, thanks to the very general structure of the PGC layer, it is easy to modify the PGC definition in order to use the shortest paths instead of the adjacency matrix transformation exponentiations. In particular, we plan to explore this option as the future development of the PGCN.

Wu et al. (2019) introduce a simplification of the graph convolution operator, dubbed simple graph convolution (SGC). The model proposed is based on the idea that perhaps the nonlinear operator introduced by GCNs is not essential, and basically, the authors propose to stack several linear GC operators. In Theorem 2 we prove that staking k GC layers is less expressive than using a single PGC layer of degree k. Therefore, we can conclude that the PGC Layer is a generalization of the SGC.

In Liao et al. (2019) the authors construct a deep graph convolutional network, exploiting particular localized polynomial filters based on the Lanczos algorithm, which leverages multi-scale information. This convolution can be easily implemented by a PGC layer. In Chen et al. (2019) the authors propose to replace the neighbor aggregation function with graph augmented features. These graph augmented features combine node degree features and multi-scale graph propagated features. Basically, the proposed model concatenates the node degree with the power series of the normalized adjacency matrix. Note that the graph augmented features differ from \({\mathbf {R}}_{k,{{{\mathcal {T}}}}}\), used in the PGC layer. Another difference with respect to the PGCN resides on the subsequent part of the model. Indeed, instead of projecting the multi-scale features layer using a structured weights matrix, the model proposed in Chen et al. (2019) aggregates the graph augmented features of each vertex and project each of these subsets by using an MLP. The model readout then sums the obtained results over all vertices and projects it using another MLP.

Luan et al. (2019) introduced two deep GCNs that rely on Krylov blocks. The first one exploits a GC layer, named snowball, that concatenates multi-scale features incrementally, resulting in a densely-connected graph network. The architecture stacks several layers and exploits nonlinear activation functions. Both these aspects make the gradient flow more complex compared to the PGCN. The second model, called Truncated Krylov, concatenates multi-scale features in each layer. In this model, differently from PGCN, the weights matrix of each layer has no structure, thus topological features from all levels are mixed together.

Another method that introduces an alternative to the message passing mechanism is proposed in Klicpera et al. (2019). Differently from PGCN, that exploits the concatenation of the powers of the diffusion operator to propagate the message thought the graph topology, Klicpera et al. proposed a graph convolution that exploits the Personalized PageRank as propagation schema. Let \(f(\cdot )\) define a 2-layer feedforward neural network. The PPNP layer is defined as: \({\mathbf {H}}= \alpha \left( {\mathbf {I}}_n - (1 - \alpha ) \tilde{{\mathbf {A}}}\right) ^{-1} f({\mathbf {X}})\), where \(\tilde{{\mathbf {A}}}={\mathbf {A}}+{\mathbf {I}}_n\). Such filter preserves locality due to the properties of Personalized PageRank.

The same paper proposed an approximation, derived by a truncated power iteration, avoiding the expensive computation of the matrix inversion, referred to as APPNP. It is implemented as a multi-layer network where the \((l+1)\)-th layer is defined as \({\mathbf {H}}^{(l+1)}= (1 - \alpha ) \tilde{{\mathbf {S}}} {\mathbf {H}}^{(l)} + \alpha {\mathbf {H}}^{(0)}\), where \({\mathbf {H}}^{(0)} = f({\mathbf {X}})\) and \(\tilde{{\mathbf {S}}}\) is the renormalized adjacency matrix adopted in GCN, i.e. \(\tilde{{\mathbf {S}}} = \tilde{ {\mathbf {D}}}^{-\frac{1}{2}} \tilde{{\mathbf {A}}}\tilde{{\mathbf {D}}}^{-\frac{1}{2}}\).

PPNP and APPNP differ significantly from PGCN, since they use a multi-layer (non-convolutional) architecture in order to exploit the different powers of the diffusion operator. Other important differences between PGCN and PPNP/APPNP, is that the second ones are specifically developed to solve the node classification task.

Abu-El-Haija et al. (2019) proposed a multilayer architecture that exploits the MixHop Graph Convolution Layer. Each layer of the model mixes a subset (managed as a hyper-parameter) of the powers of the adjacency matrix, by multiplying them by the embedding computed in the previous layer. Finally, each layer concatenates the representation obtained for each considered diffusion operator’s powers. Therefore, differently from PGC, the MixHop layer considers a subset of the powers of the adjacency matrix, and moreover, it non-linearly projects the representation obtained for each considered power. Similarly to the previously discussed multi-scale architectures, also MixHop model is developed to perform the node classification task.

Rossi et al. (2020) proposed an alternative method, named SIGN, to scale GNN to a very large graph. This method uses as a building block the set of exponentiations of linear diffusion operators. In this building block, every exponentiation of the diffusion operator is linearly projected by a learnable matrix. Moreover, differently from the PGC layer, a nonlinear function is applied on the concatenation of the diffusion operators making the gradient flow more complex compared to the PGCN.

Very recently Liu et al. (2020) proposed a model dubbed deep adaptive graph neural network, to learn node representations by adaptively incorporating information from large receptive fields. Differently from PGCN, first, the model exploits an MLP network for node feature transformation. Then it constructs a multi-scale representation leveraging on the computed nodes features transformation and the exponentiation of the adjacency matrix. This representation is obtained by stacking the various adjacency matrix exponentiations (thus obtaining a 3-dimensional tensor). Similarly to Luan et al. (2019), also in this case the model projects the obtained multi-scale representation using weights matrix that has no structure, obtaining that the topological features from all levels are mixed together. Moreover, this projection uses also (trainable) retainment scores. These scores measure how much information of the corresponding representations derived by different propagation layers should be retained to generate the final representation for each node in order to adaptively balance the information from local and global neighborhoods. Obviously, that makes the gradient flow more complex compared to the PGCN, and also impact the computational complexity.

7 Experimental setup and results

In this section, we introduce our model set up, the adopted datasets, the baselines models, and the hyper-parameters selection strategy. We then report and discuss the results obtained by the PGCN. For implementation details please refer to Appendix F.

7.1 Dataset

We empirically validated the proposed PGCN on five commonly adopted graph classification benchmarks modeling bioinformatics problems: PTC (Helma et al. 2001), NCI1 (Wale et al. 2008), PROTEINS (Borgwardt et al. 2005), D&D (Dobson and Doig 2003) and ENZYMES (Borgwardt et al. 2005). Moreover, we also evaluated the PGCN on 3 large graph social datasets: COLLAB, IMDB-B, IMDB-M (Yanardag and Vishwanathan 2015). We report more details in “Appendix G”.

7.2 Baselines and hyper-parameter selection

We compare PGCN versus several GNN architectures, that achieved state-of-the-art results on the used datasets. Specifically, we considered PSCN (Niepert et al. 2016), Funnel GCNN (FGCNN) model (Navarin et al. 2020), DGCNN (Zhang et al. 2018), GIN (Xu et al. 2019), DIFFPOOL (Ying et al. 2018) and GraphSage (Hamilton et al. 2017). Note that these graph classification models exploit the convolutions presented in Sect. 3. We report also the results of a baseline model that is structure-agnostic from Errica et al. (2020). More precisely in Errica et al. (2020), the authors adopted two different baselines, one for the chemical datasets and one for social datasets. For the chemical datasets, the model counts the occurrences of atom types in the graph by summing the features of all nodes in the graph together and then applies a single layer MLP. For social datasets, the baseline the model applies an MLP that has in input the nodes’ features then uses a global sum pooling operator and then another MLP to perform classification.

The results were obtained by performing five runs of ten-fold cross-validation. The hyper-parameters of the model (number of hidden units, learning rate, weight decay, k, q) were selected using a grid search, where the explored sets of values were changed based on the considered dataset. As validation test methodology we decided to follow the method proposed in Errica et al. (2020), that in our opinion, turns out to be the fairest. Other details about validation are reported in “Appendix K”.

7.3 Results and discussion

Table 1 Accuracy comparison between PGCNN and several state-of-the-art models on graph classification task.

The results reported in Table 1 show that the PGCN achieves higher results in all (except one) considered datasets compared to competing methods. In particular, on NCI1 and ENZYMES the proposed method outperforms state-of-the-art results. In fact, in both cases, the performances of PGCN and the best competing method are more than one standard deviation apart. Even for PTC, D&D, PROTEINS, IMDB-B and IMDB-M datasets PGCN shows a slight improvement over the results of FGCNN and DGCNN models. Furthermore, the results of PGCN in Bioinformatics datasets achieves a significant lower standard deviation (evaluated over the 5 runs of 10-fold cross-validation). For what concerns the COLLAB datasets, PGCN obtained the second higher result in the considered state-of-the-art methods. Note however that the difference with respect to the first one (GIN) is within the standard deviation.

Significativity of our results To understand if the improvements reported in Table 1 are significant or can be attributed to random chance, we conducted the two-tailed Wilcoxon Signed-Ranks test between our proposed PGCN and competing methods. This test considers all the results for the different datasets at the same time. According to this test, PGCN performs significantly better (p-value \(< 0.05\)) than PSCN, \(\hbox {DGCNN}^3\), GIN, DIFFPOOL and GraphSAGE. As for FGCNN and \(\hbox {DGCNN}^2\), four datasets are not enough to conduct the Wilcoxon test, see Japkowicz and Shah (2011), Table A.5.

7.4 Experimental results omitted in the results comparison

The results reported in Xu et al. (2019), Chen et al. (2019), Ying et al. (2018) are not considered in our comparison since the model selection strategy is different from the one we adopted and this makes the results not comparable.

The importance of the validation strategy is discussed in Errica et al. (2020), where results of a fair comparison among the considered baseline models are reported. For the sake of completeness, we also report (and compare) in Table 2 the results obtained by evaluating the PGCN method with the validation policy used in Xu et al. (2019), Chen et al. (2019), Ying et al. (2018).

Specifically, the results reported in Xu et al. (2019), Chen et al. (2019), Ying et al. (2018) are not considered in our experimental comparison since the model selection strategy is different from the one we adopted. Indeed the results reported cannot be compared with the other results reported in Table 1 of the paper, because the authors state “The hyper-parameters we tune for each dataset are [...] the number of epochs, i.e., a single epoch with the best cross-validation accuracy averaged over the 10 folds was selected.”. Similarly, for the result reported in Chen et al. (2019) for the GCN and the GFN models, the authors state “We run the model for 100 epochs, and select the epoch in the same way as Xu et al. (2019), i.e., a single epoch with the best cross-validation accuracy averaged over the ten folds is selected”. In both cases, the model selection strategy is clearly biased and different from the one we adopted. This makes the results not comparable.

Moreover, in Xu et al. (2019) the node descriptors are augmented with structural features. In GIN experiments the authors add a one-hot representation of the node degree. We decided to use a common setting for the chemical domain, where the nodes are labeled with a one-hot encoding of their atom type. The only exception is ENZYMES, where it is common to use 18 additional available features. Also in Ying et al. (2018) there is a similar problem since the authors add the degree and the clustering coefficient to each node feature vector. For the sake of completeness in Table 2 we report the results obtained by the proposed method following the same validation policy used in Xu et al. (2019), Chen et al. (2019), Ying et al. (2018). The table shows that the PGCN outperforms the methods proposed in the literature in almost all datasets.

Table 2 PGCN accuracy comparison using different values of k.

7.5 Empirical comparison versus multi-scale GCNs

In this section, we empirically compare the PGCN with some of the methods that exploit the multi-scale approach. As we highlight in Section 6, most of these models are developed to perform node classification tasks and have very specific implementations. That makes their extension to graph classification task very complex, and in some cases, it drives to a modification that defines an almost completely brand new model. For these reasons, we decided to implement only the models where the multi-scale layer, or at least the proposed mechanism, is already implemented in pytorch geometric. Thanks to this, we were able to adapt these models by inserting after the multi-scale layer representation the same readout structure used by the PGCN. The models for which we developed and implemented a graph classification version (never proposed in the literature) are four, i.e. the SGC Wu et al. (2019), Chebyshev Convolutional network (Defferrard et al. 2016), JK-Net (Xu et al. 2018), and SIGN (Rossi et al. 2020). For what concerns the JK-net, the authors propose a novel methodology to aggregate the different convolutional layer representations. Therefore, in order to perform a fair comparison, we consider the parameter k as the number of convolutional layers of the model. Moreover, as graph convolutional operator we exploited the GCN (Kipf and Welling 2017), as did by the authors in the original paper. For all these models we performed a full validation by exploiting the GRID-search following the same methodology used to validate the PGCN results, and using the same hyper-parameters grid (reported in Appendix K). Performing the validation via GRID search, for all the considered datasets, has required to run more than 55, 000 experiments.Footnote 1 Note that also the model proposed by Klicpera et al. (2019) is present as a propagation mechanism in pytorch geometric library, but it cannot be applied on graph classification task since PNPP is specifically defined for node classification. Finally, we would like to point out that the only models cited in Section 6 that are already defined to perform graph classification are the ones proposed in Chen et al. (2019), and considered in the comparison reported in Table 2.

The obtained results are reported in Table 3, and they show that the PGCN model achieves better results than the other multi-scale architectures on all the considered datasets, except PROTEINS and COLLAB. Notice that the difference between the accuracy reached by PGCN and the best result achieved on the PROTEINS and COLLAB datasets by the Chebyshev Convolutional network, is significantly lower than the standard deviation obtained by the best method. For what concerns the bio-informatic datasets, the PGCN consistently shows a lower standard deviation in comparison with the other multi-scale models.

Similarly to what done in Sect. 7, in order to prove that the improvements reported in Table 3 are significant, and not due to chance, we conducted the two-tailed Wilcoxon Signed-Ranks test between our proposed PGCN and multi-scale competing methods. Also in this case, the PGCN improvement compared to all competing methods is statistically significant (significance level \(<0.05\)).

Table 3 PGCN accuracy comparison versus other multi-scale node methods adapted to perform graph classification tasks

8 Model analysis

In this section, we analyze some crucial aspects of the proposed model. Specifically, we studied the impact of the size of the receptive fields, and the computation demand of the model, comparing the time performance of the proposed model vs FGCNN (Navarin et al. 2020), that shares a similar architecture.

8.1 Impact of receptive field size on PGCN

Most of the proposed GCN architectures in literature generally stack 4 or fewer GCs layers. The proposed PGC layer allows us to represent a linear version of these architectures by using a single layer with an even higher depth (\(k+1\)), without incurring in problems related to the flow of the topological information. Different values of k have been tested to study how much the capability of the model to represent increased topological information helps to obtain better results. The results of these experiments are reported in Table 4. The accuracy results in this table are referred to the validation sets, since the choice of k is part of the model selection procedure. We decided to take into account a range of k values between 2 and 5 for bioinformatics datasets, and between 2 to 8 for social networks datasets. The results show that it is crucial to select an appropriate value for k. Several factors influence how much depth is needed. It is important to take into account that the various datasets used for the experiments refer to different tasks. The quantity and the type of topological information required (or useful) to solve the task highly influences the choice of k. Moreover, also the input dimensions and the number of graphs contained in a dataset play an important role. In fact, using higher values of k increases the number columns of the \({\mathbf {R}}_{k,{{{\mathcal {T}}}}}\) matrix (and therefore the number of parameters embedded in \({\mathbf {W}}\)), making the training of the model more difficult. It is interesting to notice that in many cases our method exploits a larger receptive field (i.e. a higher degree) compared to the competing models. Note that the datasets where better results are obtained with \(k=2\) (PTC and PROTEINS) contain a limited amount of training samples, thus deeper models tend to overfit arguably for the limited amount of training data.

Table 4 PGCN accuracy comparison on the validation set of the datasets for different values of k.
Table 5 Time in second to perform a single training epoch (2nd and 3rd column) and to perform classification (4th and 5th column), using PGCN and FGCNN (Navarin et al. 2020), respectively.
Fig. 1
figure 1

PGCN and FGCNN training curves for D&D and NCI1 datasets

8.2 Speed of convergence

Here, we discuss the results in terms of computation demand between a proposed PGCN and FGCNN (Navarin et al. 2020). We decided to compare these two models since they present a similar readout layer, therefore the comparison best highlights how the different methodology manage the number of considered k-hops, from the point of view of performance. In Table 5, we report the average time (over the ten folds) to perform a single epoch of training and to perform the classification with both method.

In the evaluation we considered similar architectures using 3 layers for the FGCNN and \(k=2\) for PGCN. The other hyper-parameters were set with the aim to get almost the same number of parameters in both models, to ensure a fair comparison. The batch sizes used for this evaluation are the same selected by the PGCN model selection. The results show a significant advantage in using a PGC layer instead of the message passing based method exploited by FCGNN.

Concerning the speed of convergence of the two models, in Fig. 1 we report the training curves for two representative datasets (D&D and NCI). In the x-axis we report the computational time in seconds, while in the y-axis we report the loss value. Both curves end after 200 training epochs. From the curves it can be seen that PGCN converges faster or with a similar pace than FCGNN.

8.3 \({\mathbf {W}}\) structure

PGC has been defined with a \({\mathbf {W}}\) matrix that is a block upper triangular matrix. This structure is induced by the concatenation of the output of k PGC convolutions of degree ranging from 0 up to k. Indeed, using a matrix with this precise structure ensures us to obtain multi-resolution representations for each node and its neighborhoods. In fact, \(PGConv_{i,\mathcal{T},m}({\mathbf {X}},{\mathbf {A}}), \; i \in [0, \dots , k]\) contains information only about random walks of length exactly equal to i. Thanks to the structure of \({\mathbf {W}}\), different hop components are progressively mixed in a differentiated way. But this is not the only advantage obtained by using this particular structure: having a block triangular matrix increases the probability to keep a full rank matrix during training, thus to fully exploit the expressivity of the corresponding linear transformation. These two features can also be obtained by imposing even more strict constraints on W structure, further reducing the number of used parameters. For example, if we use a block diagonal matrix, we have that the node embeddings computed for each i value will not be mixed when computing the corresponding hidden representations \({\mathbf {H}}\). An even more constrained structure for W can be obtained by using a diagonal matrix. In this case, not only the embeddings for different i values will not be mixed, but also the single node features will not be mixed each other when computing \({\mathbf {H}}\). Note that, for both these matrix structures the number of trainable parameters decreases significantly with respect to the block upper triangular case.

In Fig. 2, we report the (average) distribution of the singular values of the weights \({\mathbf {W}}\) along with the heatmaps that show the structure of the matrix and the (average) values of the weights. The considered weights matrices are the results of the training phase on the NCI1 dataset using the same hyper-parameters. The singular values are computed for each W matrix learned in each of the 10-fold used to compute the reported accuracy. In the figure, we report the average (and variance) of each singular value of all folds. The obtained curves show that using a more sparse structure the number of singular values close to 0 decreases. As we pointed out, having a W with block diagonal structure, or even a diagonal structure, limits the expressiveness of the computed transformation, but on the other hand, it allows to have a faster training phase (due to the lower number of trainable parameters). To assess the pros and cons of using these more constrained structures, we report in Table 6 the accuracy obtained by the same model using each of the proposed \({\mathbf {W}}\) structures for the NCI1 dataset. The results show that using a more sparse structure the performance drops, but not in a substantial way, still maintaining a good classification accuracy.

Fig. 2
figure 2

Top: singular value distributions (average with variance) of the weights matrices \({\mathbf {W}}\) for different matrix structures (block upper triangular, block diagonal, diagonal) for the NCI1 dataset (10-folds). Bottom: heatmaps of the corresponding \({\mathbf {W}}\) matrices (average over the 10-fold)

Table 6 Accuracy computed on NCI1 test set varying the structure of the W matrix.

9 Discussion about further PGC advantages

In this section, we briefly discuss some additional advantages that the proposed PGC architecture does possess. In a nutshell, there are two main additional advantages: i) the architecture is amenable to the application of techniques to improve explainability; ii) singular values of \({\mathbf {W}}\) can ease the model selection procedure. Concerning explainability, if a diagonal \({\mathbf {W}}\) is used, any sensitivity/relevance analysis applied to the final network with respect to the input will allow to understand which specific node featureFootnote 2 is important at which specific value of k. This information can be projected back to a specific input graph, giving the possibility to explain what are the (node and structural) input features that most contribute to the output. If a block diagonal \({\mathbf {W}}\) is used, features are mixed, but still the most important values of k (structural features) for the output task can be easily identified. Concerning model selection, the singular values of \({\mathbf {W}}\) provide a guidance on the sizing of the network. In fact, independently from the imposed structure, too low singular values for \({\mathbf {W}}\) are an indication that smaller sizes for it may be preferred if forcing those singular values to 0 (easily achievable by a Singular Value Decomposition of \({\mathbf {W}}\)) returns the same (or better) performances. Of course, using a (block) diagonal matrix will make this process much easier.

10 Conclusions and Future Works

In this paper, we analyze some of the most common convolution operators evaluating their expressiveness. Our study shows that their linear composition can be defined as instances of a more general Polynomial Graph Convolution operator with a higher expressiveness. We defined an architecture exploiting a single PGC layer to generate a decoupled representation for each neighbor node at a different topological distance. This strategy allows us to avoid the bias on the flow of topological information introduced by stacking multiple graph convolution layers. We empirically validated the proposed Polynomial Graph Convolutional Network on five commonly adopted graph classification benchmarks. The results show that the proposed model outperforms competing methods in almost all the considered datasets, showing also a more stable behavior.

In the future, in addition to exploring the research directions drafted in Sect. 9, we plan to study the possibility to introduce an attention mechanism by learning a transformation \({\mathcal {T}}\) that can adapt to the input. Furthermore, we will explore whether adopting our PGC operator as a large random projection can allow to develop a novel model for learning on graph domains.