1 Introduction

Neural Networks for Graphs (GNNs), while dating back to more than 20 years ago [27], have recently gained popularity due to the good results in tasks such as semi-supervised node classification [14], link prediction [13], graph classification [22] and graph generation [18]. The main component making possible the application of neural networks to graph data is the Graph Convolution (GC), for which several definitions have been proposed in the literature. The majority of GC proposals share the basic principle of generating a (fixed-size) node representation considering its local neighborhood.

When considering graph-level prediction tasks, however, these topologically enriched representations at node-level need to be aggregated in order to obtain a single (fixed-size) representation of the graph. This aggregation component is crucial since it has to transform a variable number of node-level representations into a single graph-level one. Moreover, an effective and efficient graph-level representation should be, as much as possible, invariant to different isomorphic representations of the input graph, thus letting the learning procedure to only focus on the property prediction task, with no need to worry about the way the graph in input is represented. An approach that is commonly adopted in many graph neural network architectures proposed in the literature is to consider simple aggregation schemes such as the mean, the element-wise maximum, or the sum. However, recent results [21, 32] show that using such simple aggregations inevitably results in a loss of information due to the mix of numerical values they introduce, which may hurt the overall predictive performance of the GNN.

A much better approach, from a conceptual point of view, would be to consider all the topologically enriched representations of nodes of a graph as a set, and the aggregation function to learn as a function defined on these sets. DeepSets [32] constitute a recently proposed approach to design neural networks that take sets as input. Compared to other aggregation schemes, DeepSets scheme is maximally expressive since, under certain assumptions, it can be proved to be a universal approximator for functions over sets (see Sect. 2.3). A DeepSet projects the elements of the input set in a high-dimensional space via a learned \(\phi (\cdot )\) function, usually implemented as a multi-layer perceptron [21]. It then aggregates vectorial element representations summing them up to obtain a single vector representing the set, and finally it applies the readout, i.e., the \(\rho (\cdot )\) function (another MLP) to map the set-level representation to the output of the task at hand. Navarin et al. [21] propose a graph aggregation scheme based on DeepSets implementing the \(\phi (\cdot )\) function as a multi-layer perceptron. Motivated by the theoretical properties that the \(\phi (\cdot )\) function should possess, in this paper we propose to implement \(\phi (\cdot )\) exploiting self-organizing maps (SOMs) to map the node representations in the space defined by the activations of the SOM neurons. The resulting representations consider information about the similarity between the various inputs in an unsupervised way. In fact, similar input structures are mapped in similar output representations. Using a fully unsupervised mapping for the \(\phi (\cdot )\) function may however lead to lose task-related information. We thus propose to make the \(\phi (\cdot )\) mapping supervised by stacking, after the SOM, a layer that can be trained via supervised learning. Since we are dealing with graphs we propose, instead of simply using an MLP, to stack a Graph Convolution layer after the SOM, allowing to better incorporate topological information in the mapping. We can then apply the aggregation as prescribed by DeepSets. Finally, we implement the readout (\(\rho (\cdot )\) function) as an MLP. We show on several commonly adopted benchmark datasets for graphs the effectiveness of our proposal.

The paper is organized as follows. In Sect. 2 we introduce the necessary background concepts. In Sect. 3 we propose our SOM-based aggregation. In Sect. 4 we present our experimental results, and in Sect. 5 we analyze some properties of the proposed DeepSets-based aggregation. Section 6 concludes the paper.

2 Background

In the following, we use italic letters to refer to variables, bold lowercase letters to refer to vectors, and bold uppercase letters to refer to matrices. The elements of a matrix \(\mathbf{A}\) are referred as \(a_{ij}\) (and similarly for vectors). We use uppercase letters to refer to sets or tuples.

Let \(G=(V,E,\mathbf{X})\) be a graph, where \(V=\{v_0, \ldots ,v_{n-1}\}\) denotes the set of vertices (or nodes) of the graph, \(E \subseteq V \times V\) is the set of edges, and \(\mathbf{X} \in \mathbb {R}^{n\times s}\) is a multivariate signal on the graph nodes with the i-th row representing the attributes of \(v_i\). We define \(\mathbf{A} \in \mathbb {R}^{n \times n}\) as the adjacency matrix of the graph, with elements \(a_{ij}=1 \iff (v_i,v_j)\in E\). With \(\mathcal {N}(v)\) we denote the set of nodes adjacent to node v. Let also \(\mathbf{D} \in \mathbb {R}^{n \times n}\) be the diagonal degree matrix where \(d_{ii}=\sum _j a_{i j}\), and \(\mathbf{L}\) the normalized graph laplacian defined by \(\mathbf{L} = \mathbf{I}-\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}}\), where \(\mathbf{I}\) is the identity matrix.

2.1 Neural Networks for Graphs

The first definition of the neural network for structured data, including graphs, has been proposed by Sperduti and Starita in 1997 [27]. Later, it has been refined by Micheli [19] and Scarselli et al. [24]. The core idea is to define a neural architecture that is modeled according to the graph topology. Thanks to weights sharing, the same set of neurons is applied to each vertex in the graph, and computes its output based on the representation of the vertex and of its neighbors. As usual, the function computed by each layer is parametric. Recently [5], this approach has been referred to as graph convolution. After a certain number of graph convolution layers, the node-level representations are merged by an aggregation operator, obtaining a fixed-size graph-level representation. Finally, the readout layer transforms this representation to the output of the task.

In more detail, a general Graph Neural Network model is built according to the following equations. First, d graph convolution layers are stacked:

$$\begin{aligned} \mathbf{h}_v^{\mathrm{GC}(i)}=f\left( \hbox{graphconv}\left( \mathbf{h}^{\mathrm{GC}(i-1)}_v, \left\{ \mathbf{h}^{\mathrm{GC}(i-1)}_u | u \in \mathcal {N}(v)\right\} \right) \right. , \end{aligned}$$
(1)

where \(f(\cdot )\) is an element-wise non-linear activation function, \(\hbox{graphconv}(\cdot ,\cdot )\) a graph convolution operator, and \(\mathbf{h}^{GC(i)}_v\) is the representation of node v at the i-th graph convolution layer, \(1 \le i \le d\), and \(\mathbf{h}^{GC(0)}_v = \mathbf{x}_v\) (i.e. the row of \(\mathbf{X}\) corresponding to v). Then, an aggregation function is applied:

$$\begin{aligned} \mathbf{h}^{S}=\hbox{aggr}\left( \{\mathbf{h}^{\mathrm{GC}(i)}_v | v \in V_{G}, 1 \le i \le d\}\right) , \end{aligned}$$
(2)

where \(\hbox{aggr}(\cdot )\) is the aggregator function. Note that the aggregation may depend on all the hidden representations computed by the different GC layers and not just the last one. \(\mathbf{h}^{S}\) is the fixed-size graph-level representation. Then, the \(\hbox{readout}(\cdot )\) (implemented as a multi-layer perceptron) applies some non-linear transformation on \(\mathbf{h}^{S}\). Finally, we apply the output layer (e.g., the LogSoftMax for a classification problem)

$$\begin{aligned} \mathbf{o}=\hbox{LogSoftMax}(\hbox{readout}(\mathbf{h}^{S})). \end{aligned}$$
(3)

For what concerns the graph convolution, in this paper we mainly consider a particular operator inspired by the Weisfeiler-Lehman graph invariant, which has been proposed by Morris et al. [20]. This GC, named GraphConv, is defined as follows:

$$\begin{aligned} \mathbf{H}^{(i+1)}&=\mathbf{H}^{(i)}{\bar{\mathbf{W}}}^{(i)}\nonumber \\&\quad +\mathbf{A}\mathbf{H}^{(i)}{\hat{\mathbf{W}}}^{(i)}. \end{aligned}$$
(4)

where \(\mathbf{H}^0=\mathbf{X}\), and \({\bar{\mathbf{W}}}^{(i)}\) and \({\hat{\mathbf{W}}}^{(i)}\) are two weights matrices.

A more complete discussion about the GC operators is reported in “Appendix A”. In the following, we present different definitions of node aggregation in the literature.

2.1.1 Aggregation of node representations

After stacking a number of graph convolution layers, an aggregation operator maps the set of representations associated with the single vertices into a graph-level representation. Different approaches to implement this aggregation operator are possible.

Linear operators The simplest aggregation operators adopted in the literature are linear, namely the average and the sum of vertex representations. NN4G [19] computes, for each graph, the average graph vertex representation for each hidden layer, and concatenates them. Other approaches consider only the last graph convolution layer to compute such an average [1]. In [9], multi-layer perceptrons are applied to transform node representations before a sum aggregator is applied.

Non-linear operators SortPooling is a non-linear pooling operator [33] used in conjunction with concatenation to obtain an aggregation operator. The idea is to select a pre-determined number of vertex embeddings using a sorting function, and to concatenate them, obtaining a graph-level representation of fixed size. Notice, however, that this representation ignores some of the nodes of the graph.

Another approach consists in using a set2set model, which is a simplified Neural Turing Machine, for handling sets as inputs [9] of the readout function. The model is capable of mapping sets to other sets in output, thus it is more powerful than what is required for classification or regression tasks. This complexity makes this instantiation hard to train, introducing unneeded complexity in the model. Finally, it has been shown that DeepSets, a general formulation of a universal approximator of functions over sets [32], can be successfully adopted as aggregator operator [21] on graph nodes. More details are reported in Sect. 2.3.

2.2 Self-organizing map

The main goal of the Self-Organizing Map (SOM) algorithm is to transform the incoming signal pattern of an arbitrary dimension into a two- or three-dimensional discrete map, and to perform this transformation adaptively in a topological ordered fashion. The neurons of a self-organizing map are distributed over a lattice that is usually two- or three-dimensional, and equipped with synaptic weight vectors \(\mathbf{s}_i, \; i \in [1 \ldots p]\), where p is the number of the neurons of the SOM. These neurons compete among themselves to be activated, resulting in that only one neuron, the winner of the competition dubbed best matching unit (BMU), is selected as the prototype for each input pattern. The weights of the BMU are made closer to the input vector, as well as the weights of its neighbours, although at a minor degree, with the aim of preserving at the lattice level topological relationships in the input space. Different variants for the SOM model can be considered. The one used by us is described in detail in Sect. 3.1.

2.3 DeepSets

It has been proven [32] that any function sf(X) over a set X, satisfying the following two properties:

  1. 1

    Variable number of elements in input, i.e., each input is a set \(X=\{x_1, \ldots , x_{m}\}\) with \(x_i\) belonging to some set \(\mathfrak {X}\) (typically a vectorial space) and \(m > 0\);

  2. 2

    Permutation invariance;

can be decomposed in the form:

$$\begin{aligned} sf(X) = \rho \left( \sum _{x_i \in X} \phi (x_i)\right) , \end{aligned}$$
(5)

for some \(\rho (\cdot )\) and \(\phi (\cdot )\) functions, if \(\mathfrak {X}\) is countable. This is the general formulation of DeepSets [32], which constitute a valid option to implement an aggregation operator. First of all, they can natively take in input the sets of topologically enriched (by graph convolutions) representations of graphs’ nodes. Moreover, being in principle universal approximators for a wide range of functions over countable sets, or uncountable sets with a fixed size, they are potentially very expressive from a functional point of view. Here we elaborate on this last capability by recalling some concepts from Zaheer et al. [32]. One of the main arguments of the universal approximation proof of DeepSets for the countable case, i.e., where the elements of the sets are countable (\(|\mathfrak {X}| \le n_0\)), relies on the fact that, given the space of input sets \(\mathcal {X}\subseteq 2^{\mathfrak {X}}\), any function over sets can be decomposed as \(sf(X)=\rho (e(X))\), where \(e: \mathcal {X} \rightarrow \mathbb {R}^n\), \(e(X)=\sum _{x_i \in X}\phi (x_i)\), combining the elements \(x_i\in X\) non-linearly transformed by the \(\phi (\cdot )\) function, maps different sets in different points. Since we should be able to possibly associate (via \(\rho (\cdot )\)) different outputs for different inputs, the \(\phi (\cdot )\) function should map its inputs (the elements of the sets) to an encoding of natural numbers, and \(e(\cdot )\) should provide a unique representation for every \(X \in \mathcal {X}\). In the countable case, one way to achieve such property is to define the \(\phi (\cdot )\) function mapping each set element to a representation that is orthogonal to the representations of every other set element. For the uncountable case, the scenario becomes more complex, requiring \(\phi (\cdot )\) to be a homomorphism.

3 SOM-based DeepSets projection

In this paper, we propose to implement the \(\phi (\cdot )\) function of DeepSets by exploiting a Self-Organizing Map (SOM). A SOM can be defined to map each input embedding in the one-hot activation map of the SOM lattice, where only the winning neuron has a value different from zero. If we consider an infinitely wide SOM, it is easy to see that every input embedding activates a different winning neuron, and that different inputs are mapped in linearly independent SOM activation maps. Thus, using a SOM to implement the \(\phi (\cdot )\) function is, from a conceptual point of view, a viable approach. Following this path, however, poses some problems from the point of view of learnability. In fact, the encoding of sets in orthogonal vectors hinders the possibility to exploit similarities among examples. To solve this problem, we propose to smooth the SOM representations by exploiting relatively small lattices, and to return in output not only 1 for the winning neuron, but also a smaller adjusted value for its neighbors.

Finally, in order to compensate for the unsupervised nature of SOMs, we suggest to process the SOM output by a graph convolutional layer that, in addition to allow for supervised learning, can better preserve the topological information of the graph with respect to a simple MLP. Following this operation, the aggregation operator as prescribed by DeepSets can be applied. Even though DeepSets theoretically proves that the sum aggregation is maximally expressive, inspired by DiffPool [31] we consider the concatenation of different statistics such as the sum, average and component-wise maximum. We found that this choice leads to slightly improved overall predictive performance, probably slightly easing the training of the network. Finally, we implement the DeepSets readout function \(\rho (\cdot )\) as an MLP.

The proposed SOM-based projection is only one component of a graph neural network. In the following, we describe the overall network architecture, and we provide more details on the proposed aggregation operator.

We start considering graph convolution layers to provide a representation for each node in the graphs, using the following general equation:

$$\begin{aligned} \mathbf{h}^{GC(1)}_v=f\left( \hbox{graphconv}(\mathbf{x}_{v}, \{ \mathbf{x}_{u} | u \in \mathcal {N}(v)\} )\right) , \end{aligned}$$
(6)

that can be implemented with any graph convolution layer described in Appendix A. We stack d graph convolution layers as follows:

$$\begin{aligned} \mathbf{h}^{GC(i)}_v=f\left( \hbox{graphconv} ( \mathbf{h}^{GC(i-1)}_v, \{\mathbf{h}^{GC(i-1)}_u |u \in \mathcal {N}(v)\} ) \right) , \end{aligned}$$
(7)

where \(1 \le i \le d\). Let us consider the node representation generated by a graph convolution layer \(\mathbf{h}^{GC(i)}_v\). We define our SOM-based aggregation operator as follows:

$$\begin{aligned} \mathbf{h}'^{S(i)}_v&=\hbox{som}(\mathbf{h}^{\mathrm{GC}(i)}_v), \end{aligned}$$
(8)
$$\begin{aligned} \mathbf{h}''^{S(i)}_v&=f\left( \hbox{graphconv}\left( \mathbf{h}'^{S(i)}_v,\left\{ \mathbf{h}'^{S(i)}_u | u \in \mathcal {N}(v)\right\} \right) \right) ,\end{aligned}$$
(9)
$$\begin{aligned} \mathbf{h}^{S(i)}&=\hbox{aggr}\left( \{\mathbf{h}''^{S(i)}_v | v \in V_{G}\right) , \end{aligned}$$
(10)

where \(\hbox{som}(\cdot )\) is the function computing the SOM activations (detailed below).

In our architecture, we apply our proposed SOM-based projection operator to each graph convolution output, as shown in Fig. 1, obtaining d graph-level feature maps (one for each layer): \(\mathbf{h}^{S(1)}\), \(\mathbf{h}^{S(2)}, \ldots , \mathbf{h}^{S(d)}\). These feature maps are then concatenated, obtaining a single graph-level representation:

$$\begin{aligned} \mathbf{h}^{S}=[\mathbf{h}^{S(1)},\mathbf{h}^{S(2)}, \ldots , \mathbf{h}^{S(d)}]. \end{aligned}$$
(11)

We can apply the readout and the output layer (that together implement the \(\rho (\cdot )\) function in DeepSets) to the graph-level representation \(\mathbf{h}^S\), obtaining the output of our network:

$$\begin{aligned} \mathbf{o}=\hbox{LogSoftMax}(\hbox{readout}(\mathbf{h}^{S})). \end{aligned}$$
(12)
Fig. 1
figure 1

A graphical layout of the proposed architecture, with an expanded view of the SOM-based aggregation block 1 (bottom right)

The readout function is composed of several dense feed-forward layers, where we consider the number of layers and the number of neurons per layer as hyper-parameters. Each one of these layers uses the ReLU activation function, and is defined as follows:

$$\begin{aligned} \mathbf{h}^{R(j)} = \hbox{ReLU}\left( \mathbf{W}^{R(j)} \mathbf{h}^{R(j-1)} + \mathbf{b}^{R(j)}\right) ,\quad j \in [1,\ldots ,k-1] \end{aligned}$$
(13)

where \(\mathbf{h}^{R(0)}=\mathbf{h}^{S}\). Finally, the output layer of the neural network for a c-class classification task is defined as follows:

$$\begin{aligned} \mathbf{o} = \hbox{LogSoftMax}(\mathbf{W}^{o} \mathbf{h}^{R(k-1)} + \mathbf{b}^{o}). \end{aligned}$$
(14)

To reduce the covariate shift during training and to attenuate overfitting effects, we applied the batch normalization and dropout to the output of each graph convolutional layer.

3.1 SOM details

To define the SOM-based aggregation block, we adopted a SOM model that exploits a two-dimensional map, \(p' \times p''\). We recall that we have an aggregation operator, and thus a SOM, associated with each graph convolution. The neurons in the SOM at the k-th convolution layer \(s^{(k)}_{i,j}\) are thus identified by two indices, \(i \in \{1,\ldots , p'\}\) and \(j \in \{1,\ldots , p''\}\). As for the similarity measure, we use the 2-norm of the difference between the input and the SOM synaptic weights \(\mathbf{s}^{(k)}_{i,j}\). Thus, the distance between a SOM neuron and the input embedding for node v at layer k is defined as

$$\begin{aligned} d_{v,i,j}^{(k)}=\left\| \mathbf{h}^{\mathrm{GC}(k)}_v- \mathbf{s}^{(k)}_{i,j}\right\| _2. \end{aligned}$$
(15)

This measure is used to compute the BMU for each forward pass of the SOM, identified by the tuple of its position:

$$\begin{aligned} \left( i^{*(k)}_v,j^{*(k)}_v\right) =\hbox{argmin}_{i,j} \left( d_{v,i,j}^{(k)}\right) . \end{aligned}$$
(16)

As the output of the SOM module, and in accordance with our discussion about the DeepSets approach, we propose to exploit the distances \(d_{v,i,j}^{(k)}\) to compute a similarity measure in the interval [0, 1] for each SOM neuron:

$$\begin{aligned} \varsigma ^{S(k)}_{v,i,j} = \alpha ^{\sigma , i^*, j^*}_{i,j} e^{-d_{v,i,j}^{(k)}+d_{v,i^*,j^{*}}^{(k)}}, \end{aligned}$$
(17)

where we omit the dependency from v and k from \(i^*\) and \(j^*\) for the ease of notation. In this way, each BMU will always output the value 1. The neighborhood function \(\alpha ^{\sigma , i^*, j^*}_{i,j}\) is defined following a Gaussian distribution over the topological distance between a neuron and the BMU as follows:

$$\begin{aligned} \alpha ^{\sigma , i^*, j^*}_{i,j} = e^{-({(i^*-i)^2+(j^*-j)^2}) / \sigma ^2}, \end{aligned}$$
(18)

where \(\sigma\) is a hyper-parameter. Combining Eqs. (17) and (18), we finally get

$$\begin{aligned} \varsigma _{v,i,j}^{S(k)}= e^{-d_{v,i,j}^{(k)}+d_{v,i^*,j^{*}}^{(k)}-({(i^*-i)^2+(j^*-j)^2}) / \sigma ^2}. \end{aligned}$$
(19)

We flatten the two-dimensional output of the SOM obtaining a vectorial representation of the activation map of the SOM for each node, i.e., \(\mathbf{h}'^{S(k)}_v\) where each element of such vector is defined as \(\{\mathbf{h}'^{S(k)}_{v}\}_{i\cdot p'+j}=\varsigma ^{(k)}_{v,i,j}\).

Finally, we implement the aggregator function \(\hbox{aggr}(\cdot )\) computing global statistics over the nodes in a graph as follows:

$$\begin{aligned} \mathbf{h}^{S(i)}&= \hbox{aggr}\left( \left\{ \mathbf{h}''^{S(i)}_v |\ v \in V_G\right\} \right) \end{aligned}$$
(20)
$$\begin{aligned}&=\left[ \hbox{avg}_{ v \in V_G}\left( \mathbf{h}''^{S(i)}_v\right) , \hbox{max}_{v \in V_G}\left( \mathbf{h}''^{S(i)}_v\right) , \hbox{sum}_{v \in V_G}\left( \mathbf{h}''^{S(i)}_v \right) \right] . \end{aligned}$$
(21)

3.2 Training procedure

To train a GNN that exploits the proposed SOM-based aggregation we have to deal with the fact that the SOM model requires an unsupervised training algorithm, while we consider supervised learning problems. For this reason, we developed a four-step training procedure. The basic idea is to, first of all, learn initial embeddings for nodes without using the SOM (pre-training). The aim of this step is to learn stable representations to use for training the SOM. In the second step, the SOM is trained using an unsupervised method. The representations developed by the SOM are then fed for supervised training of the rest of the network (post-SOM GC and readout). Training is concluded with a fine-tuning training involving all network parameters, except for the SOM weights.

Let us define the set of parameters of the Graph Convolution Block in equation (7) as \(\theta ^{\mathrm{GCB}}\), the SOM parameters as \(\theta ^{\mathrm{SOM}}\), and all the other parameters (post-SOM GC and readout) as \(\theta ^{\mathrm{rest}}\).

The first step consists in training the \(\theta ^{\mathrm{GCB}}\) parameters of the Graph Convolution Block. We add, only for this first step, an ad-hoc readout layer, which we refer to as pre-training readout, to perform supervised learning with backpropagation (see Fig. 1, top left). That allows to train this part of the model separately from the rest of the network. The pre-training readout layer is defined as follows:

$$\begin{aligned} \mathbf{h}^{S(i)}_{\mathrm{pre}}&= \hbox{aggr}\left( \left\{ \mathbf{h}^{\mathrm{GC}(i)}_v |\ v \in V_G\right) \right. \nonumber \\&=\left[ \hbox{avg}_{ v \in V_G}\left( \mathbf{h}^{\mathrm{GC}(i)}_v\right) , \hbox{max}_{v \in V_G}\left( \mathbf{h}^{\mathrm{GC}(i)}_v\right) , \hbox{sum}_{v \in V_G}\left( \mathbf{h}^{\mathrm{GC}(i)}_v\right) \right] , \end{aligned}$$
(22)
$$\begin{aligned} \mathbf{h}^{S}_{\mathrm{pre}}&=\left[ \mathbf{h}^{S(1)}_{\mathrm{pre}},\mathbf{h}^{S(2)}_{\mathrm{pre}}, \ldots , \mathbf{h}^{S(d)}_{\mathrm{pre}}\right] , \end{aligned}$$
(23)
$$\begin{aligned} \mathbf{o}_{\mathrm{pre}}&= \hbox{LogSoftMax}\left( \mathbf{W}_{o_{\mathrm{pre}}} \mathbf{h}^{S}_{\mathrm{pre}} + \mathbf{b}_{o_{\mathrm{pre}}}\right) . \end{aligned}$$
(24)

The second step consists in training the \(\theta ^{\mathrm{SOM}}\) weights of the SOM layers of the SOM-based aggregation blocks. We adopted the well-known unsupervised training method proposed by Kohonen [16]. In the third phase, the aim is to train \(\theta ^{\mathrm{rest}}\) parameters. This phase exploits again supervised training and the part of the model involved in these phases is represented by Eqs. (9)–(12). In this phase, the SOMs output will not change. The last training step consists in a fine-tuning phase. The aim is to tune the model parameters \(\theta ^{\mathrm{GCB}}\) and \(\theta ^{\mathrm{rest}}\), maintaining \(\theta ^{\mathrm{SOM}}\) fixed. Further cycles of adaptation could take place by retraining the SOMs and so on. In this paper, however, we do not apply these further adaptation cycles. The pseudo-code that summarizes the training procedure is reported in Algorithm 1.

figure a

4 Experimental results

In this section, we report and discuss the results obtained by our proposed model, which we refer to as SOM-GCNN. We start introducing the adopted datasets, our model set up, and the hyper-parameter selection strategy. We then present and discuss our experimental results.

4.1 Datasets

The proposed model, and the other models compared with it, were empirically validated on commonly adopted graph classification benchmarks. Specifically, we used five datasets modeling bioinformatics problems: PTC [12], NCI1 [29], PROTEINS, [3], D&D [6] and ENZYMES [3]. PTC, and NCI1 involve chemical compounds represented by their molecular graph, where the atom type is represented by node labels, and bonds correspond to edges. The prediction task for PTC concerns the carcinogenicity for male rats by chemical compounds. In NCI1 the graphs represent anti-cancer screens for cell lung cancer. The remaining datasets, PROTEINS, D&D and ENZYMES, involve graphs that represent proteins. Amino acids are represented by nodes while edges connect amino acids that are less then 6 Å apart. All the prediction tasks are binary classification tasks, except for the ENZYMES dataset, where a multi-class classification of chemical compounds (6 classes) is represented. Relevant statistics about the datasets are reported in Table 1.

Table 1 Datasets statistics

4.2 Model selection and experimental setup

In order to test the effectiveness of the proposed SOM aggregation block, we applied it to a simple GNN model inspired by the FGCNN model [21], which obtained very good results in the considered datasets. Our model is reported in Fig. 1. In particular the base model (Graph Convolution Block in the figure) is composed of three Graph Convolution layers (therefore d is set to 3 in Eq. (11)). In FGCNN, the number of neurons of each layer increases with increasing depth. More precisely, \(\mathbf{h}^{\mathrm{GC}(1)} \in \mathbb {R}^l\), \(\mathbf{h}^{\mathrm{GC}(2)} \in \mathbb {R}^{2l}\), and \(\mathbf{h}^{\mathrm{GC}(3)} \in \mathbb {R}^{3l}\). As graph convolution operator we used the GraphConv [20]. As activation function f for the various convolution layers we choose the ReLU.

For what concerns the readout part of the model, we experimented with a shallow setting, where \(k = 1\), therefore only the output layer defined in Eq. (14) is used after the aggregation layer \(\mathbf{h}^S\) in Eq. (11). We also tested a model that exploits a deeper version of the readout composed of three fully connected feed-forward layers before the output layer (\(k = 3\)). In any case, the pre-training readout exploited a single output layer, i.e., \(k = 1\). The results reported in Table 4 were obtained by performing five runs of tenfold cross-validation.

The goal of this work is to evaluate the benefit of using the SOM-based aggregation. For this reason, we focused our attention on the adopted SOM parameters, in particular, we carefully validated the dimensions of the lattice, and the learning rate used during the self-organizing phase. These two parameters were selected using a limited grid search where the explored set of values changes based on the considered dataset (see Table 2). Due to the high time requirements of performing an extensive grid search, we decided to limit the number of values taken into account for each hyper-parameter, by performing several preliminary tests. As score, we used the average accuracy computed over the 10-fold cross-validation on the validation sets, and we used the same set of selected hyper-parameters for each fold. The selection of the epoch was performed for each fold independently based on the accuracy on the validation set. Another parameter that influences the SOM performance is \(\sigma\) that modifies the impact of the neighborhood function (see Eq. (18)). For this value, we use a heuristic solution by making the value dependent on the maximum dimension of the SOM:

$$\begin{aligned} \sigma = \hbox{max}(p', p'') /2. \end{aligned}$$
(25)

During training the trend of neighborhood function is influenced not only by the value of \(\sigma\), but also by the learning rate. Indeed at each training iteration the \(\sigma\) value used by the neighborhood function to update the weights is defined as follows:

$$\begin{aligned} \sigma _{\mathrm{training}}=\sigma \cdot lr \cdot ((1- i_c)/i_t), \end{aligned}$$
(26)

where lr is the learning rate, \(i_c\) is the current iteration, and \(i_t\) is the total number of training iterations.

Table 2 Sets of hyper-parameters values used for model selection via grid search

An issue related to the proposed SOM-based aggregation is the high number of hyper-parameters. This aspect makes time-consuming to perform a complete grid search considering all hyper-parameters. For this reason, for what concerns the other parameters of the model, we adopted a random search strategy. Therefore, we believe that all reported results may be slightly improved by a more systematic search. The selected parameters for each dataset/configuration are reported in Table 3. For further details about the experimental setup, it is possible to consult the publicly available code.Footnote 1

Table 3 Hyper-parameters selected via random search

4.3 Baselines

We compare SOM-GCNN with several GNN architectures, which achieved state-of-the-art results on the different datasets. In Table 4 we report the results of such comparison. Specifically, we consider PSCN [23], Funnel GCNN (FGCNN) [21], DGCNN [34], GIN [30], DIFFPOOL [31] and GraphSage [11]. More details about these architectures are provided in “Appendix B”. For FGCNN, we did not consider the additional loss term based on the Weisfeiler-Lehman graph kernel to ensure a fair comparison.

The results reported in the GIN paper [30] cannot be compared with the other results reported in Table 4, because the authors state “The hyper-parameters we tune for each dataset are [...] the number of epochs, i.e., a single epoch with the best cross-validation accuracy averaged over the ten folds was selected.”. Similarly, for the result reported in Chen et al. [4] for the GCN and the GFN models, the authors state “We run the model for 100 epochs, and select the epoch in the same way as [30], i.e., a single epoch with the best cross-validation accuracy averaged over the ten folds is selected.” In both cases, the model selection strategy is different from the one we adopted. This makes the results not comparable. Moreover, in the GIN and Diffpool papers [30, 31] the node descriptors are augmented with structural features (a one-hot representation of the node degree in both works, and a clustering coefficient in the latter). We decided to use a common setting for the chemical domain, where the nodes are labeled with a one-hot encoding of their atom type only. The only exception is ENZYMES, where it is common to use 18 additional available features, thus we decided to adopt this setting. These issues, and the importance of the validation strategy, are highlighted and discussed in [8]. The same paper reports the results of a fair comparison between the considered baseline models. We follow this setting as much as possible.

Table 4 Accuracies by SOM-GCNN and state-of-the-art models on the five used datasets

For the sake of comparison we also reported the results of the DGCNN-DeepSets [21]. We want to point out that these results are not completely comparable with the ones obtained by our models for two reasons: (i) the DeepSets was implemented by MLPs on a different base GNN architecture, i.e., a DGCNN model; (ii) the adopted validation strategy is different from the one we applied in this study.

4.4 Discussion of experimental results

The results reported in Table 4 show that the predictive performance of our proposed SOM-GCNN architecture are highly competitive compared to the other considered methods in all the considered datasets, with the exception of ENZYMES. Specifically, the proposed method shows the best performances on PTC, NCI1, PROTEINS and D&D datasets, significantly outperforming state-of-the-art methods in many cases. NCI1 and D&D are the two datasets that have higher numbers of nodes/graphs. From this point of view, it is also interesting to notice that the experimental results on D&D are the only case where the deep version of the readout shows improved results compared to the shallow one. Indeed, in the other datasets, the higher number of parameters of the deep readout and the limited size of the datasets favor the onset of overfitting in this setting. That is more visible in the fine-tuning phase. In this regard, it is important to point out that the selection of the type of output highly depends on the considered dataset. For the seek of analysis, we report the results of both types of output for all datasets. In some cases (like DD and PROTEINS) we consider the model with a deep readout, while for the other datasets is clearly more convenient to use a shallow readout. Note that the selection of the readout type is part of the validation process. Moreover, the proposed base aggregation method turns out to be completely detached from the adopted readout architecture. The second best performing method on D&D is DGCNN in the validation setting of Navarin et al. [21]. However, in the more rigorous validation setting of Errica et al. [8] that we followed, the difference gap compared to our proposed SOM-GCNN is higher. On PROTEINS, SOM-GCNN achieves slightly improved predictive performances compared to PSCN [23], but with a significantly lower variance. Moreover, its performance are close to DGCNN-DeepSets [21], for which we recall there is a difference in the validation setting that can favor DGCNN-DeepSets. The ENZYMES dataset involves a 6-class classification task, and in this case, the proposed SOM-based aggregation method seems very efficient in increasing the graph convolutional accuracy. Indeed, the results obtained after the SOM-based aggregation improve significantly the ones achieved after the first training phase. Unfortunately, the accuracy of the base GNN model (Table 5) is significantly lower than the ones reached by the state-of-the-art models. We argue that this difference in accuracy performance is highly related to the readout part of the model. In fact, the model considered during the first training phase turns out to be similar to the FGCNN network reported in Table 4 except for the readout part, which in our case is shallow. The choice of using a simple and shallow readout layer for pre-training the first part of the model is due to the fact that the readout part (Eqs. (22)–(24)) will be discarded after the pre-training phase. Moreover, the first training step aims to initialize properly the weights of the Graph Convolution Block to provide meaningful inputs to the second training step that involves the SOMs.

Table 5 Accuracy obtained after the various training steps

In Table 5, we reported the accuracy results reached in the different training phases (we merge the SOM training and the readout part training since both steps are necessary to obtain a fully trained model). In Table 4 we reported for each dataset/configuration the best result with or without fine-tuning, selected according to the validation error. Indeed the fine-tuning phase does not always improve the model classification performance. It turns out to be more useful when the size of the training set is large. Indeed, also, in this case, the problem is related to the already mentioned overfitting phenomenon. The results reported in Table 5 show that the application of the SOM-based aggregation blocks always allows to reach improved results compared to the base model (pre-training). In particular, the comparison between the pre-training and the readout accuracy in the shallow model (where the exploited readouts are very similar) highlights the benefit of adopting the proposed aggregation technique. Table 6 reports the results obtained on the validation set by using different SOM dimensions. In three out of five datasets, the SOMs with highest dimensions obtain better results. One exception is the deep readout version of the SOM-GCNN on NCI1, where the results of the (15,10) SOM are very close to the one with the lattice of (20,15). Moreover, in this case, also the overfitting influences the results. In our preliminary experiments, we noticed that using larger lattices drives the model to overfit quickly. The other exceptions are the PTC and ENZYMES datasets. These datasets are the ones with the smallest number of nodes/graphs (see Table 1), which reduces the advantage of using larger lattices. Having larger SOMs increases also the number of parameters of the model, and considering the limited size of these two datasets, using a larger lattice may boost the rise of the overfitting issues. In the next section, we analyze in more detail the SOM’s behavior in the proposed aggregation mechanism.

Table 6 Accuracy obtained on the validation sets by using SOM with different lattice sizes

In Table 5, we also report the results of a pre-trained graph convolutional network with four GC layers. These results turns out to be crucial to assess the benefit of using the SOM-based aggregation blocks. Indeed, considering that a SOM-based aggregation block includes a GC layer after the SOM, it is important to verify that the observed improvements with respect to the pre-trained model are not due to that additional GC layer. The comparison with a pre-trained model with four GC layers is thus intended to provide this information. The results show that, in all cases, the use of a further GC layer does not significantly improve the accuracy results, which turn out to be similar to the results obtained with the three layers graph convolutional block. Therefore we can conclude that the SOM plays a crucial role in the proposed aggregation method.

5 SOM-based aggregation block analysis

In this section, we analyze how the use of the SOM layer modifies the representations learned by a GNN. In particular, we start by analyzing the benefits of using a trained SOM vs. using an untrained one. Subsequently, we study the global distribution of the representations generated by the SOM. We do this by comparing the normalized distribution of the distances between all couples of nodes’ embeddings in the whole dataset in the representation computed by the SOM, versus the ones obtained by the first set of GC layers and by the post-SOMs CG layers. As a result of this analysis, we question whether the GC projection layers are actually useful. For this reason we perform an ablation study that shows that these layers are beneficial for the final performance. Finally, we explore how nodes belonging to positive/negative graphs are mapped onto the SOM lattice. This last analysis is quite interesting since it allows, in principle, to explain the output of the model. This is an additional advantage of integrating SOM modules into the model.

5.1 Advantage in using SOM-based aggregation block

In order to evaluate the positive effect of using SOM projection, we defined a model by randomly initializing the SOM synaptic weights \(\mathbf{s}^{(k)}_{i,j}\), and trained it without performing the second step, where the SOMs weights are trained using the unsupervised method. Thus, the SOMs of the obtained models have all the weights set to random values. The results of these experiments are reported in Table 7, and they can be directly compared versus the ones reported in Table 5, since we used the same hyperparameters. Also in this case the results were obtained by performing five runs of 10-fold cross-validation. It is possible to notice the in all cases the obtained accuracy drops significantly using random SOM synaptic weights. Note that, in some cases the results after the readout training are even lower than the ones computed in the pre-training phase. It is important to point out that also in this case the readout training phase is performed. These results suggest that the SOMs allow to obtain a node embedding that makes it somehow easier to perform the classification task. In the following, we analyze how the SOM influences the obtained node embeddings.

Table 7 Accuracy obtained using random initialized SOMs, and without performing the unsupervised learning phase on them

5.2 Node embedding distances

In Fig. 2 we report the cumulative distribution of the (normalized) pairwise nodes distances of all nodes of the training set’s graphs of the NCI1 dataset. We observed a similar behavior for the other datasets, thus we decided to omit the plots for brevity. The black line represents the cumulative distribution of the input nodes representations (node labels). In the figure, it is possible to notice that in NCI1 dataset there are almost \(10^{4}\) nodes that share the same input representation.

Fig. 2
figure 2

Cumulative distribution of the pairwise nodes distances of all nodes of the training set’s graphs of the NCI1 dataset. Note that in the y-axis each value is written in scientific notation and is multiplied by \(10^7\)

The blue line represents the cumulative distribution of the representations returned by the Graph Convolution block of the model. In this specific case, the considered node representation is the concatenation of the three graph convolutional layers outputs:

$$\begin{aligned} \mathbf{h}^{\mathrm{GC}}_v=\left[ \mathbf{h}^{\mathrm{GC}(1)}_v, \mathbf{h}^{\mathrm{GC}(2)}_v, \mathbf{h}^{\mathrm{GC}(3)}_v\right] . \end{aligned}$$
(27)

The SOM in the i-th SOM-based aggregation block receives in input the corresponding node embeddings \(\mathbf{h}^{GC(i)}_v\) and computes in output \(\mathbf{h}'^{S(i)}_v\). The red curve represents the cumulative distance between the concatenation of the various SOM Layers Aggregation outputs:

$$\begin{aligned} \mathbf{h}'^{S}_v=\left[ \mathbf{h}'^{S(1)}_v, \mathbf{h}'^{S(2)}_v,\mathbf{h}'^{S(3)}_v\right] . \end{aligned}$$
(28)

Finally, we analyze the representation obtained by the Graph Convolution Layer Projection. Also in this case we considered the distances of the concatenation of the various layer outputs:

$$\begin{aligned} \mathbf{h}''^{S}_v=\left[ \mathbf{h}''^{S(1)}_v, \mathbf{h}''^{S(2)}_v,\mathbf{h}''^{S(3)}_v\right] . \end{aligned}$$
(29)

The yellow curve represents the correspondent cumulative distribution.

The first observation is that SOM tends to increase the pairwise distance between nodes. We think this is consistent with the SOM property to exploit several neurons in the lattice for dense populated zones in the input space. The second observation is that the application of the Graph Convolution Layer Projection after the SOM Layers Aggregation reduces the distances among the representations. However, the resulting representations are less close than the ones computed by the first Graph Convolution Layers. Note that the representations \(\mathbf{h}''^{S(i)}_v\) encapsulate more information than \(\mathbf{h}^{\mathrm{GC}(i)}_v\) (and consequently, by \(\mathbf{h}'^{S(i)}_v\)), because the additional Graph Convolution layer extends of one hop the node neighbor.

As a final remark, we think it is fair to claim that the SOM seems to help the model to develop more separable internal representations, which may explain the better observed performances.

5.3 Graph convolution layer projection

As reported in Fig. 1, the proposed SOM-based aggregation block is composed of three parts: SOM Layer Projection, Graph Convolution Layer Projection and, Aggregation Layer. In the previous section, we showed that Graph Convolution Layer Projection reduces the distances among the representations obtained by the SOM Layer Projection. In order to study whether the contraction made by Graph Convolution Layer is useful, we trained the model removing these specific components. The results of this ablation study are reported in Table 8, and they can be directly compared versus the ones reported in Table 5, since we used the same hyperparameters.

Table 8 Accuracy obtained training the model where the Graph Convolution Layer Projection were removed from the SOM-based aggregation blocks

The results show the benefits of having a Graph Convolutional layer that allows to aggregate the SOM Layer Projection outputs according to the graph topology. In all datasets, the accuracy decreases significantly compared to the results obtained with the full model (Table 5). Note that for datasets that have a higher average number of edges per graph (DD, ENZYMES), the accuracy drop is even more pronounced.

5.4 SOM lattice representations

In Figs. 345 and 6, we report heatmaps with entries computed as follows. Let \(G_{+}\) and \(G_{-}\) be set of graphs belonging to class \(+\) and −, respectively. For each layer level k, and class \(c\in \{+,-\}\), we define the entry (ij) of \(\hbox{heatmap}(c)^{(k)}\) as follows:

$$\begin{aligned} \hbox{heatmap}(c)_{i,j}^{(k)} = \frac{1}{|G_c|}\sum _{g\in G_{c}}\sum _{v\in V_g} \varsigma _{v,i,j}^{(k)}. \end{aligned}$$
(30)

Notice that the inner summation computes a graph-based contribution of each neuron, since it aggregates the outputs of the neuron for all the nodes of a single graph. Then, the outer summation computes an average over the set of graphs belonging to the same class c.

Each figure reports the heatmap of each SOM layer (one per SOM-based aggregation block). We also report the differences between the two heatmaps in order to highlight how the SOM output conveys information related to the classification task. Our analysis is limited to the datasets that model a binary classification task.

In all datasets, the area where the neurons are activated by the input widens with increasing depth, and in the first layer, it tends to be close to the lattice borders. This behavior could be due to the limited size of the input vocabulary. In NCI1 (Fig. 3) the representation of the positive and negative classes is quite similar in the first layer, while a more prominent differentiation arises in deeper layers. In PTC and in DD (Figs. 4 and 5, respectively) the visible distinction between the two classes could be noticed in the magnitude of the SOM activations, while the area of activation is partially overlapping. A very sharp difference between the two classes representations can be noticed in the PROTEINS dataset (Fig. 6). In fact, in this case both the magnitude and the distribution of the active SOM neurons seems to be very different.

Fig. 3
figure 3

Heatmaps computed according to Eq. (30) on a batch of the NCI1 test set. Each image row presents from left to right: the heatmap obtained by considering only nodes that belong to graphs with negative target, the heatmap obtained by considering only nodes that belong to graphs with positive target, and (on a different scale) the difference between the two heatmaps (positive minus negative). Each row reports the heatmaps relative to a single SOM-based aggregation block, from first (top) to third (bottom)

Fig. 4
figure 4

Heatmaps computed according to Eq. (30) on a batch of the PTC test set. Each image row presents from left to right: the heatmap obtained by considering only nodes that belong to graphs with negative target, the heatmap obtained by considering only nodes that belong to graphs with positive target, and (on a different scale) the difference between the two heatmaps (positive minus negative). Each row reports the heatmaps relative to a single SOM-based aggregation block, from first (top) to third (bottom)

Fig. 5
figure 5

Heatmaps computed according to Eq. (30) on a batch of the D&D test set. Each image row presents from left to right: the heatmap obtained by considering only nodes that belong to graphs with negative target, the heatmap obtained by considering only nodes that belong to graphs with positive target, and (on a different scale) the difference between the two heatmaps (positive minus negative). Each row reports the heatmaps relative to a single SOM-based aggregation block, from first (top) to third (bottom)

Fig. 6
figure 6

Heatmaps computed according to Eq. (30) on a batch of the PROTEINS test set. Each image row presents from left to right: the heatmap obtained by considering only nodes that belong to graphs with negative target, the heatmap obtained by considering only nodes that belong to graphs with positive target, and (on a different scale) the difference between the two heatmaps (positive minus negative). Each row reports the heatmaps relative to a single SOM-based aggregation block, from first (top) to third (bottom)

In PTC, D&D and PROTEINS, the activations of the positive and the negative classes differ both in terms of the width of the area and in the magnitude of activations. Indeed, all these datasets have a ratio between the positive and the negative class different from 1. We argue that this difference is related to the imbalanced class distribution of these datasets. Moreover, NCI1 is the only dataset that is almost perfectly balanced and the positive and negative heatmaps turn out to be more similar in terms of output magnitude.

6 Conclusion and future works

In this paper, we proposed a node aggregation scheme for graph convolutional neural networks inspired by DeepSets [32] that exploits self-organizing maps followed by graph convolutions to transform the node embeddings, and then aggregates them according to the DeepSets formulation in a fixed-size graph-level representation. Due to the unsupervised nature of the SOM training algorithm, we developed an ad-hoc training algorithm for supervised learning tasks to learn the parameters of the resulting graph neural network.

We empirically validated the proposed SOM-based aggregation method on five commonly adopted graph classification benchmarks modeling bioinformatics problems. Experimental results show that our proposal achieves improved predictive performance compared to competing methods in the majority of the considered datasets. We investigated how much the SOM component helps in reaching these results. We showed that training the SOM is beneficial for the predictive performance, as well as stacking a GC projection layer after the SOM. Moreover, we studied the distances between node representations after the first graph convolution block, comparing them with the distances after applying the SOM mapping and the subsequent Graph Convolution Layer Projection. The comparison shows the benefit of using the SOM-based aggregation block to increase the distances between the node embeddings, thus potentially making the classification task easier.

Finally, thanks to the use of SOMs in the proposed model, we were also able to produce heatmaps representing each graph. By taking the average of such heatmaps over a set of graphs, we studied how SOMs represent graphs belonging to different classes.

A limit of the proposed approach is its relatively high number of hyper-parameters, which makes the application of a complete grid search time-consuming. In the future, we plan to study and develop a simpler version of the SOM-based aggregation that depends on a significantly lower number of hyper-parameters. We will extend the study to different types of lattice configurations as well as to more advanced types of SOM models (e.g., the Generative Topographic Map (GTM) [2]. Finally, we will explore a fully supervised extension of our proposal exploiting supervised SOM training algorithms, e.g. the algorithm proposed by Hagenbuchner et al. [10].