1 Introduction

Graphs are natural data structures, which appear in various domains such as bioinformatics (Sharan and Ideker 2006), cheminformatics (Trinajstic 2018), social network analysis (Scott 2011) and so on, where nodes (vertices) represent objects and edges represent the relations between them. A popular approach to learning with graph structured data is to make use of graph kernels. Essentially, a graph kernel is a measure of the similarity between two graphs and must satisfy two fundamental requirements of being a valid kernel: (1) symmetric and (2) positive semi-definite (PSD). Furthermore, the requirements of designing a graph kernel are: it should capture the semantic inherent in the graph structures (e.g. substructures of different levels), and it must be efficiently computable (Vishwanathan et al. 2010).

A number of graph kernels have been proposed in literature such as random walk (Kashima 2003), shortest path (Borgwardt and Kriegel 2005), Weisfeiler-Lehman (WL) subtree (Shervashidze and Borgwardt 2009) kernels, just to name a few. Most of them are based on \({\mathcal {R}}\)-Convolution framework (Haussler 1999), which decomposes two graphs into substructures and adds up the similarities between their substructures to compute kernel values. Different graph kernels are defined under different ways of decomposition (types of substructures). For instance, the substructures can be random walks (Kashima 2003), shortest paths (Borgwardt and Kriegel 2005) or subtree patterns (Vishwanathan et al. 2010). Among these, WL subtree kernels have been shown to achieve great prediction performance while being efficiently computable. The key point is that it simply employs a WL based color refinement scheme to embed each node in a given graph into a vector of WL labels, which correspond to subtree patterns of the graphs. Then, the kernel between two graphs is defined as the sum of all pairwise similarities between any two node embeddings of the two graphs.

Following a different approach, WL based optimal assignment (WL-OA) kernel (Kriege et al. 2016) assigns one node embedding of one graph to one embedding of the other such that the total similarities between assigned node embeddings is maximized. This is also known as optimal assignment problem in combinatorial mathematics (Munkres 1957). In a similar vein, Wasserstein WL (WWL, Togninalli et al. 2019) uses optimal transport (OT), also known as Wasserstein distance (Villani 2008), for measuring the distance between two graphs based on their WL node embeddings. The distance is then converted into a similarity matrix through Laplacian kernel. Furthermore, both of these similarity matrices are shown to be valid kernels due to the hierachy property of WL labels (see Kriege et al. 2016 and Togninalli et al. 2019 for more details).

One of the main drawbacks of these kernels is that they are predefined feature extraction without learning the importance of substructures to the problem. This results in the decoupling of data representation and learning process. In these kernels, substructures are given the same weights. However, for the problems such as molecule classification, it is known that only subparts of the molecules are responsible for their properties. Therefore, we wish to be able to give weights to their substructures to have higher classification performance and model interpretation. Based on this motivation, we propose a model to learn the weights of subtree patterns (extracted by WL labeling scheme). Our work extends WWL kernels (Togninalli et al. 2019) by formulating an OT based distance as a parametric function of subtree pattern weights before converting into kernels. We also propose an efficient stochastic learning algorithm to estimate the weights and derive a generalization gap bound for the algorithm. Finally, through experiments on synthetic and four real-world data sets, we show that learning important subtree patterns by our proposed method can lead to more accurate predictive performance and extract important patterns which enhance the classification results.

The remainder of the paper is organized as follows: in Sect. 2, we review graph kernels which are based on WL labeling scheme, including WL subtree, WL-OA and WWL kernels. In Sect. 3, we present our method that parameterizes the Wasserstein distance between two graphs with WL labeling scheme as a function of subtree patterns and present the stochastic algorithm for learning parameters of the function. In Sect. 4, we derive a generalization bound for the learning algorithm. In Sect. 5, experimental results on the synthetic and real-world data sets are provided. Finally, we conclude by summarizing this work and discussing possible extensions in Sect. 6.

2 Related work

In this paper, we consider the binary classification problem for graph structured data: given a collection of labeled graphs \((g_{i}, y_{i}), i=1,..,n\) (where n is the number of examples) drawn from an unknown joint distribution \({\mathcal {P}}\) over \({\mathcal {G}}\times \{-1,1\}\), where \({\mathcal {G}}\) is a space of graphs. We wish to learn a classifier \(h:{\mathcal {G}}\rightarrow \{-1,1\}\), which is based on a similarity function \(K:{\mathcal {G}}\times {\mathcal {G}}\rightarrow [-1,1]\). If K is symmetric and positive semi-definite (PSD), it is called a valid kernel.

There are a number of proposed kernels on graphs, see (Kashima 2003; Borgwardt and Kriegel 2005; Shervashidze and Borgwardt 2009). In general, they are defined based on \({\mathcal {R}}\)-Convolution framework (Haussler 1999), that is, each graph \(g\in {\mathcal {G}}\) is decomposed into substructures, and a kernel value \(K(g,g^\prime )\) is defined as a sum of pairwise similarities between their substructures. In fact, many graph kernels can be considered as instances of the \({\mathcal {R}}\)-Convolution framework under different decomposition into substructures. The substructures can be random walks (Kashima 2003), shortest paths (Borgwardt and Kriegel 2005) or circle subtrees (Shervashidze and Borgwardt 2009). Among these, Weisfeiler-Lehman (WL) subtree kernels (Shervashidze and Borgwardt 2009) and its variants have been shown to achieve great performance for the graph classification tasks. In this work, we focus on WL based kernels and will review them in the following subsections.

2.1 Weisfeiler-Lehman (WL) scheme for node embeddings

Weisfeiler-Lehman (WL) subtree kernels (Shervashidze and Borgwardt 2009) are based on an iterative colour refinement (also known as WL labeling scheme) and have been shown to achieve great performance for graph classification task. For each node of a given graph, the WL labeling scheme creates a sequence of ordered strings by the aggregation of the labels of the node and its neighbors; these strings are then hashed or indexed to produce compressed updated node labels or new indices. If the iteration of the scheme is increased, these obtained labels represent increasingly broader neighborhood of each node. More specifically, for a graph \(G=(V,E)\) with initial labels \(\ell _{0}(v)\) for \(v\in V\) and let H be the number of WL iterations, we can define a sequence of refined labels \((\ell _{0},\ell _{1},...,\ell _{H})\), where \(\ell _{h+1}\) is obtained from \(\ell _{h}\) by the following procedure: \(\ell _{h+1}(v)=\mathrm {hash}(\ell _{h}(v), {\mathcal {N}}_{h}(v))\), where \({\mathcal {N}}_{h}(v)\) denotes a lexicographically sorted sequence of labels of v’s neighbors at iteration h and the \(\mathrm {hash}\) function is to create a updated compressed node label for v. We use perfect hashing for the \(\text {hash}\) function, as in Shervashidze and Borgwardt (2009), ensuring two nodes at iteration \(h+1\) have the same label if and only if their label and those of their neighbors at iteration h are the same. Throughout the rest of paper, we denote the set of WL labels at iteration h as \(\Sigma ^{h}\), for \(h=1,...,H\), and the set of all WL labels as \(\Sigma =\cup _{h=1}^{H}\Sigma ^{h}\). The WL labeling scheme is illustrated in Fig. 1a.

Based on WL labeling scheme, a WL node embedding scheme was proposed to generate node embeddings from node labels of the graphs (Shervashidze and Borgwardt 2009).

Definition 1

(WL based node embedding scheme, Shervashidze and Borgwardt 2009). Let \(G=(V,E)\) and let H be the number of WL iterations. For every \(h\in \{1,...,H\}\), we define the node embedding \(x^{h}(v)\) of a node \(v\in V\) and graph embedding \(\mathbf{X }^{h}(G)\) of G at iteration h as follows:

$$\begin{aligned} x^{h}(v) = \ell ^{h}(v) \text {, } \mathbf{X }^{h}(G)=[x^{h}(v_{1}),...,x^{h}(v_{n_{V}})]^{T} \end{aligned}$$
(1)

Then the graph embedding of G can be defined as:

$$\begin{aligned} f^{H}(G): {\mathcal {G}}&\mapsto \Sigma ^{n_{V}\times H}\\ G&\mapsto \left[ \mathbf{X }^{1}(G)^{T},...,\mathbf{X }^{H}(G)^{T} \right] ^{T} \end{aligned}$$

where \(n_{V}\) is the number of nodes in G. With the WL node embedding scheme above, we are ready to introduce some notations which will be used throughout the rest of paper.

Notations Let \(D_{h}^{\text {Ham}}(f^{h}(G), f^{h}(G^\prime ))\) be the Hamming distance matrix where each entry is the normalized Hamming distance between the corresponding node embeddings of G and \(G^\prime\) at iteration h, defined as:

$$\begin{aligned} d_{h}^{\text {Ham}}(u,v) = \frac{1}{h}\sum _{i=1}^{h}d_{i}^{\text {disc}}(u,v) \text {, } d_{i}^{\text {disc}}(u,v)= {\left\{ \begin{array}{ll} 0 &{} \text {if } u_{i}=v_{i} \\ 1 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

where u and v denote two node embeddings, \(d_{h}^{\text {Ham}}(u,v)\) denotes the normalized Hamming distance between u and v at iteration h and \(d_{i}^{\text {disc}}(u,v)\) denotes the discrete distance between u and v at iteration i. Similarly, let \(D^{\text {Disc}}_{h}(\mathbf{X }^{h}(G), \mathbf{X }^{h}(G^\prime ))\) be the discrete distance matrix where each entry is the discrete distance between the corresponding node embeddings of G and \(G^\prime\) at iteration h. It is easy to see that \([D^{\text {Ham}}_{h}]_{ij}\in [0,1]\) and \([D^{\text {Disc}}_{h}]_{ij}\in \{0,1\}\). We also define a base kernel which corresponds to the averaged number of feature shared by two node embeddings as:

$$\begin{aligned} k_{h}(u,v) = \frac{1}{h}\sum _{i=1}^{h} \mathbb {1}(u_{i}=v_{i}) \end{aligned}$$
(3)

It is easy to see that \(k_{h}(u,v)=1- d_{h}^{\text {Ham}}(u,v)\). Thanks to WL node embedding scheme, a graph can be represented as a point cloud or a set of node embeddings. Measuring the similarity (or dissimilarity) between two graphs boils down to measuring the similarity (or dissimilarity) between two sets of embeddings. Each node embedding captures information about the neighborhood of the corresponding node (or a rooted subtree pattern). In the following subsections, we will review different WL based graph kernels derived from different ways of comparing their sets of WL node embeddings.

2.2 WL subtree kernels

WL subtree kernels (Shervashidze and Borgwardt 2009) simply employ the aforementioned WL labeling scheme to extract subtree patterns, which represent the neighborhood of each node in the graph up to a given distance (or number of hops H). Essentially WL subtree kernel counts the number of common WL labels. In the context of WL node embedding scheme, it can be computed by summing all pairwise similarities between node embeddings of two graphs. More formally, for two graphs G and \(G^\prime\) with two sets of node embeddings \(f^{H}(G)\) and \(f^{H}(G^\prime )\), respectively, the WL subtree kernel value between them can be defined as:

$$\begin{aligned} \mathbf{K }^{\text {WL}}(G,G^\prime ) = \sum _{\mathbf{x }\in f^{H}(G)}\sum _{\mathbf{y }\in f^{H}(G^\prime )}k_{H}(\mathbf{x },\mathbf{y }) \end{aligned}$$
(4)

It is obvious that as the base kernel \(k_{H}(\mathbf{x },\mathbf{y })\) (defined in Eq. (3)) is equal to the number of WL labels (subtree patterns) shared by two node embeddings, the kernel \(\mathbf{K }^{\text {WL}}(G,G^\prime )\) is equal to the total number of WL subtree patterns shared by two graphs.

2.3 WL-based optimal assignment kernels

Optimal assignments are natural measures of similarity between two sets of points. In particular, for two sets of points, the goal is to assign one point of one set to another point of the other set (one-to-one correspondence) such that the sum of similarities between assigned points is maximized. Finding such an optimal alignment or bijection is also known as the well-studied assignment problem in combinatorial optimization (Munkres 1957). However, a challenge is how to design a valid kernel based on optimal assignments.

Kriege et al. (2016) introduced a restricted class of kernels, called strong kernels, that guarantees the construction of valid optimal assignment based kernels. An important result is that the strong kernels give rise to hierarchies defined on the domain of kernels. Based on this, the authors proposed WL-based optimal assignment (WL-OA) kernels with the WL node embedding scheme. WL-OA kernels employ the base kernel, defined in Eq. (3), which satisfies the requirement of being a strong kernel as the sequence of refined WL labels \((\ell _{0},\ell _{1},...,\ell _{H})\) gives rise to a family of nested subsets, which can be represented by a hierarchy. Consequently, WL-OA kernels are valid. Formally, given two graphs G and \(G^\prime\) with two sets of node embeddings \(f^{H}(G)\) and \(f^{H}(G^\prime )\), respectively, the WL-OA kernel value between them is defined as:

$$\begin{aligned} \mathbf{K }^{\text {OA}}(G,G^\prime ) = \max _{B\in {\mathcal {B}}(f^{H}(G),f^{H}(G^\prime ))}\sum _{(x,y)\in B}k_{H}(\mathbf{x },\mathbf{y }) \end{aligned}$$
(5)

where \({\mathcal {B}}(f^{H}(G),f^{H}(G^\prime ))\) is the set of all bijections between two sets \(f^{H}(G)\) and \(f^{H}(G^\prime )\). To apply this kernel to graphs of different number of nodes, we can fill up the graph with smaller number of nodes, says \(f^{H}(G^\prime )\), by new node embeddings z with \(k(\mathbf{x },\mathbf{z })=0\) for all \(\mathbf{x }\in f^{H}(G)\) without changing the result. It is worth noting that the WL-OA kernels take the similarities of aligned node embeddings into account, while the WL subtree kernels consider all pairwise similarities.

Fig. 1
figure 1

a Illustration of Weisfeiler-Lehman (WL) node embedding scheme with two iterations (\(H=2\)). b Illustration of different WL based graph kernels: WL Subtree kernel is computed by considering all pairwise similarities between node embeddings of two graphs; WL optimal assignment kernel takes the similarities of only aligned node embeddings into account; For Wasserstein WL kernel, one node embedding of one graph can be coupled with multiple node embeddings of the other graph for computing the kernel

2.4 Wasserstein WL kernels

Optimal transport (OT), also known as Wasserstein distance function (Villani 2008), has gained much attraction in machine learning community as a powerful tool for the comparsion of two probability distributions. The naive computation of this distance between two discrete measures, e.g. point clouds, involves solving transport problem. Formally, let \(X=\{\mathbf{x }_{1},...,\mathbf{x }_{m}\}\) and \(Y=\{\mathbf{y }_{1},...,\mathbf{y }_{n}\}\) be two sets of points, where m and n denote the size of two sets X and Y, respectively; \(\mathbf{p }\in {\mathbb {R}}^{m}\) and \(\mathbf{q }\in {\mathbb {R}}^{n}\) are two discrete probability distributions over X and Y, respectively. We use \(d_{ij}\) to denote the distance between \(\mathbf{x }_{i}\) and \(\mathbf{y }_{j}\), e.g. the squared Euclidean distance \(d_{ij}=\left\| \mathbf{x }_{i}-\mathbf{y }_{j}\right\| ^{2}_{2}\). The Wasserstein distance is formulated as a linear program over the transportation matrix (or joint probability) \(\mathbf{P }\in {\mathbb {R}}^{m\times n}\):

$$\begin{aligned} \begin{aligned} {\mathcal {W}}_{1}(X,Y, d) =&\min _{\mathbf{P }_{ij}}\sum _{i=1}^{m}\sum _{j=1}^{n}\mathbf{P }_{ij}d_{ij}\\ \text {subject to }&\sum _{i=1}^{m}\mathbf{P }_{ij}=\mathbf{q }_{j}\text {, } \forall j\in [1,n] \\&\sum _{j=1}^{n}\mathbf{P }_{ij}=\mathbf{p }_{i}\text {, } \forall i\in [1,m] \\ \end{aligned} \end{aligned}$$
(6)

With WL node embedding scheme to generate node embeddings for graphs, Togninalli et al. (2019) evaluated the pairwise Wasserstein distance between graphs with the normalised Hamming (2) as the ground distance. Then, Wasserstein WL (WWL) kernel is defined as an instance of Laplacian kernels, see Eq. (7).

$$\begin{aligned} \begin{aligned} \mathbf{D }^{\text {WWL}}(G,G^\prime )&={\mathcal {W}}_{1}(f^{H}(G),f^{H}(G^\prime ), d^{\text {Ham}}_{H})\\ \mathbf{K }^{\text {WWL}}(G,G^\prime )&=e^{-\gamma \mathbf{D }^{\text {WWL}}(G,G^\prime )} \end{aligned} \end{aligned}$$
(7)

where \(\gamma\) is a hyperparameter.

In general cases, it is not necessarily possible to derive a valid kernel from the Wasserstein distance. However, thanks to the special property of WL labels and normalized Hamming distance (2), \(\mathbf{D }^{\text {WWL}}\) was shown to be conditionally negative definite (CND), resulting in the validity of \(\mathbf{K }^{\text {WWL}}\), by proving the following lemma (see Togninalli et al. 2019 for its proof):

Lemma 1

If a transportation matrix \(P^{H}\) is optimal solution of (6) with the ground distance \(d_{H}^{\text {Ham}}\) (2) between node embeddings at iteration H, then we have the two following claims:

  1. 1.

    \(P^{H}\) is also optimal solution of (6) with the discrete distance \(d_{H}^{\text {Disc}}\) between H-iteration values.

  2. 2.

    \(P^{H}\) is also optimal solution of (6) with the normalised Hamming distance \(d_{H-1}^{\text {Ham}}\) between node embeddings at iteration \(H-1\).

Let \(P^{*}\) be the optimal solution of (6) for \(D^{\text {Ham}}_{H}\). From the above lemma, it is also the optimal solution for \(D^{\text {Disc}}_{h}\), \(h=1,..,H\). The Wasserstein distance between two graphs G and \(G^\prime\) in (7) can be simplified as follows:

$$\begin{aligned} \mathbf{D }^{WWL}(G,G^\prime ) = \frac{1}{H}\sum _{h=1}^{H}{\mathcal {W}}_{1}(\mathbf{X }^{h}(G),\mathbf{X }^{h}(G^\prime ), d^{\text {Disc}}_{h}) \end{aligned}$$
(8)

The Eq. (8) is a sum of OT distances with the discrete distances as ground metrics, which are CND. Therefore, the sum is also CND, leading to the validity of the similarity matrix \(\mathbf{K }^{\text {WWL}}\).

3 Incorporating subtree pattern importance into WL based graph kernels

One of the main limitations of kernels is the decoupling of data representation and learning process, that is, the kernel must be predefined prior to learning, leading to limited predictive performance. Furthermore, in prediction tasks for molecular data, the output might be determined by the presence of a few important substructures, while these kernels contain all substructures with equal weights. Motivated by this drawback, in this paper we address the problem of incorporating subtree pattern weights for WWL kernel (Togninalli et al. 2019). To this end, we aim to learn new kernels from a parametric form of Wasserstein distance taking into account subtree pattern weights (8), and learn these weights from data optimally for the task.

3.1 Parametric form of Wasserstein distance with subtree pattern weights

To derive a parametric form of the distance function (8), we rely on the following simple observation:

Lemma 2

Let S be a set of elements, \(X=\{\mathbf{x }_{1},...,\mathbf{x }_{m}\}\) and \(Y = \{\mathbf{y }_{1},...,\mathbf{y }_{n}\}\) (\(X,Y \subseteq S\)) be two multiset of S of m and n samples, respectively, the Wasserstein distance between them with the discrete distance as the ground metric is determined by:

$$\begin{aligned} {\mathcal {W}}_{1}(X,Y) = 1 - \sum _{v\in S} \min (\mu _{X}(v), \mu _{Y}(v)) \end{aligned}$$
(9)

where \(\mu _{X}(v)\) denotes the mass density function of the multiset X with \(v\in S\).

Applying this lemma to Eq. (8), we have:

$$\begin{aligned} \mathbf{D }^{\text {WWL}}(G, G^\prime )=1- \frac{1}{H}\sum _{h=1}^{H}\sum _{v\in \Sigma ^{h}}\min (\mu _{\mathbf{X }_{h}(G)}(v), \mu _{\mathbf{X }_{h}(G^\prime )}(v)) \end{aligned}$$
(10)

Our idea is to give each substructure or WL label v a nonnegative weight \(w_{v}\in {\mathbb {R}}_{\ge 0}\) for its importance to the problem, so the parametric form of Eq. (10) is defined as follows:

$$\begin{aligned} b- \frac{1}{H}\sum _{h=1}^{H}\langle \mathbf{w }_{h},\mathbf{z }_{h}\left( G,G^\prime \right) \rangle \end{aligned}$$
(11)

where \(\mathbf{w }_{h}\), \(\mathbf{z }_{h}\left( G,G^\prime \right) \in {\mathbb {R}}^{|\Sigma ^{h}|}\) are the vectors of entries \(w_{v}\) and \(\min (\mu _{\mathbf{X }_{h}(G)}(v), \mu _{\mathbf{X }_{h}(G^\prime )}(v))\), respectively, for \(v\in \Sigma ^{h}\); b is a constant to ensure that the value of parametric function is nonnegative. In vector form, this can be expressed as:

$$\begin{aligned} d_{\mathbf{W }}(G,G^\prime )=b- \langle \mathbf{W }, \mathbf{Z }\left( G,G^\prime \right) \rangle \end{aligned}$$
(12)

where \({\left\{ \begin{array}{ll} \mathbf{Z }\left( G,G^\prime \right) =\frac{1}{H}\left[ \mathbf{z }_{1}\left( G,G^\prime \right) ^{T},...,\mathbf{z }_{H}\left( G,G^\prime \right) ^{T} \right] ^{T} \\ \mathbf{W }=\left[ \mathbf{w }_{1}^{T},...,\mathbf{w }_{H}^{T} \right] ^{T} \end{array}\right. }\)

The parametric form (12) is a linear function with respect to the parameter vector \(\mathbf{W }\in {\mathbb {R}}^{d}\) (\(d=|\Sigma ^{1}|+...+|\Sigma ^{H}|\)) and \(\mathbf{Z }\left( G,G^\prime \right)\) is considered as a feature vector of a pair of graphs G and \(G^\prime\). Once the parameters are estimated, we can derive a similarity matrix through the Laplacian kernel as in Eq. (7). More importantly, as \(\mathbf{W }\) is nonnegative, it is easy to see that \(d_{\mathbf{W }}\) is a CND function, and thus the derived similarity matrix is valid.

3.2 Formulation of learning subtree pattern weights \(\mathbf{W }\)

We aim to learn the parameters W in Eq. (12) using the notions of metric learning (Kulis 2012). That is two input graphs with the same labels are encouraged to be closer while the two with different labels become far away from each other. In other words, within class distances should be small, while between class distances should be large. For this purpose, as a loss function for a graph pair g and \(g^\prime\), we can use the following two hinge loss function: \(\max (0, \alpha _{1}-d_{\mathbf{W }}(g, g^\prime ))\) if g and \(g^\prime\) are with different labels and \(\max (0, d_{\mathbf{W }}(g, g^\prime )-\alpha _{2})\) otherwise, for learning subtree pattern weights, where \(\alpha _{1}\) and \(\alpha _{2}\) are constants (\(\alpha _{1}\ge \alpha _{2}\)). The former yields a penalty if g and \(g^\prime\) of different labels are closer than \(\alpha _{1}\) while the latter yields a penalty when g and \(g^\prime\) of the same label are more distant than \(\alpha _{2}\). Instead of using these functions in the optimization problem, we use their smooth versions: \(V_{1}\) and \(V_{2}\) (see Fig. 2), which offer useful properties for deriving a generalization bound for the problem in Sect. 4. The derivation of these functions is based on the connection between the strong convexity of a function and Lipschitz continuous gradient of its Fenchel dual (see more details in Nesterov (2005)).

More concretely, let \(D_{n} = \{z_{1}=(g_{1},y_{1}),...,z_{n}=(g_{n},y_{n})\}\) where \(g_{i}\in {\mathcal {G}}\) and \(y_{i}\in {\mathcal {Y}}=\{-1,1\}\), for \(i=1,...,n\), we formulate a constrained minimization problem for learning subtree pattern weights as follows:

$$\begin{aligned}&\begin{array}{c} \hbox {minimize}\\ {\mathbf{W }} \end{array}&{\frac{1}{n^{2}}\sum _{i=1}^{n}\sum _{j=1}^{n}\ell (\mathbf{W }, z_{i}, z_{j})}\nonumber \\&\hbox {subject to }&{\mathbf{W }\in {\mathcal {C}}} \end{aligned}$$
(13)

where \({\mathcal {C}}=\{\mathbf{W }\in {\mathbb {R}}^{d}:\mathbf{W }=\left[ \mathbf{w }_{1}^{T},...,\mathbf{w }_{H}^{T} \right] ^{T}, ||\mathbf{w }_{h}-\mathbf{c }_{h}||_{2}\le \epsilon _{h}, 1\le h \le H \}\) (\(\mathbf{c }_{h}\) and \(\epsilon _{h}\) are constant vectors and scalars); \(\ell\) is a continuously differentiable function and defined as follows: \(\ell (\mathbf{W }, z_{i}, z_{j})={\left\{ \begin{array}{ll} V_{1}(\mathbf{W }, g_{i}, g_{j}) &{} \text { if } y_{i}\ne y_{j}\\ V_{2}(\mathbf{W }, g_{i}, g_{j}) &{} \text { otherwise} \end{array}\right. }\)

$$\begin{aligned}&V_{1}(\mathbf{W }, g_{i}, g_{j})\nonumber \\&\quad = {\left\{ \begin{array}{ll} 0 &{} \text { if } d_{\mathbf{W }}(g_{i}, g_{j}) \ge \alpha _{1}\\ \alpha _{1} - \frac{\sigma }{2} - d_{\mathbf{W }}(g_{i}, g_{j})&{} \text { if } d_{\mathbf{W }}(g_{i}, g_{j}) \le \alpha _{1}-\sigma \\ \frac{1}{2\sigma }\left( d_{\mathbf{W }}(g_{i}, g_{j}) - \alpha _{1} \right) ^{2} &{} \text { if } \alpha _{1} - \sigma< d_{\mathbf{W }}(g_{i}, g_{j}) < \alpha _{1} \end{array}\right. } \end{aligned}$$
(14)
$$\begin{aligned}&V_{2}(\mathbf{W }, g_{i}, g_{j})\nonumber \\&\quad = {\left\{ \begin{array}{ll} 0 &{} \text { if } d_{\mathbf{W }}(g_{i}, g_{j}) \le \alpha _{2}\\ d_{\mathbf{W }}(g_{i}, g_{j}) - \alpha _{2} - \frac{\sigma }{2}&{} \text { if } d_{\mathbf{W }}(g_{i}, g_{j}) \ge \alpha _{2}+\sigma \\ \frac{1}{2\sigma }\left( d_{\mathbf{W }}(g_{i}, g_{j}) - \alpha _{2} \right) ^{2} &{} \text { if } \alpha _{2}< d_{\mathbf{W }}(g_{i}, g_{j}) < \alpha _{2}+\sigma \end{array}\right. } \end{aligned}$$
(15)

and \(\alpha _{1}\), \(\alpha _{2} \left( \alpha _{1} \ge \alpha _{2} \right)\) and \(\sigma\) are constants; \(d_{\mathbf{W }}(g_{i}, g_{j})\) is calculated from \(\mathbf{Z }\left( g_{i}, g_{j}\right)\) as in Eq. (12). To reduce notation, we use \(\mathbf{Z }_{i,j}\) rather than \(\mathbf{Z }\left( g_{i}, g_{j}\right)\) in the rest of the paper.

Lemma 3

Let \(\beta =\max _{1\le i<j\le n} \left\| \mathbf{Z }_{ij}\right\| _{2}\), \(L=\frac{\beta ^{2}}{\sigma }\), and \(M=\max (b-\frac{\sigma }{2}-\alpha _{2}, \alpha _{1}-\frac{\sigma }{2})\), the loss function \(\ell\) defined in problem (13) is L-lipschitz, \(\beta\)-smooth and upper bounded by M.

Proof

First, it is obvious to see that as \(0\le d_{\mathbf{W }}\le b\), we have the following bounds: \(0\le V_{1}\le \alpha _{1}-\frac{\sigma }{2}\) and \(0\le V_{2}\le b-\frac{\sigma }{2}-\alpha _{2}\). Therefore, \(\ell\) is upper bounded by M.

We derive the first and second order derivatives of the smooth function \(V_{1}\) as follows:

$$\begin{aligned}&\nabla V_{1}(\mathbf{W }, g_{i}, g_{j})\nonumber \\&\quad = {\left\{ \begin{array}{ll} 0 &{} \text { if } d_{\mathbf{W }}(g_{i}, g_{j}) \ge \alpha _{1}\\ \mathbf{Z }_{ij}&{} \text { if } d_{\mathbf{W }}(g_{i}, g_{j}) \le \alpha _{1}-\sigma \\ \frac{1}{\sigma }\left( \alpha _{1}- d_{\mathbf{W }}(g_{i}, g_{j}) \right) \mathbf{Z }_{ij} &{} \text { if } \alpha _{1} - \sigma< d_{\mathbf{W }}(g_{i}, g_{j}) < \alpha _{1} \end{array}\right. } \end{aligned}$$
(16)
$$\begin{aligned}&\nabla ^{2} V_{1}(\mathbf{W }, g_{i}, g_{j})\nonumber \\&\quad = {\left\{ \begin{array}{ll} 0 &{} \text { if } d_{\mathbf{W }}(g_{i}, g_{j}) \ge \alpha _{1}\\ 0&{} \text { if } d_{\mathbf{W }}(g_{i}, g_{j}) \le \alpha _{1}-\sigma \\ \frac{1}{\sigma }\mathbf{Z }_{ij}\mathbf{Z }_{ij}^{T} &{} \text { if } \alpha _{1} - \sigma< d_{\mathbf{W }}(g_{i}, g_{j}) < \alpha _{1} \end{array}\right. } \end{aligned}$$
(17)

In order to prove the function \(V_{1}\) is L-Lipschitz and \(\beta\)-smooth, it is sufficient to show that the norm of its derivative is always less than L: \(\left\| \nabla V_{1}(\mathbf{W },g_{i}, g_{j})\right\| _{2}\le L\) and the spectral norm (or the maximum eigen value) of its second order derivative is always less than \(\beta\): \(||\nabla ^{2} V_{1}(\mathbf{W }, g_{i}, g_{j})||\le \beta \text {, } \forall g_{i}, g_{j}\in {\mathcal {G}}\). Indeed, from Eqs. (16) and (17), we have: \(\left\| V_{1}(\mathbf{W },g_{i},g_{j})\right\| _{2}\le \left\| \mathbf{Z }_{ij}\right\| _{2}\) and \(\left\| \nabla ^{2}V_{1}(\mathbf{W }, g_{i}, g_{j})\right\| \le \frac{1}{\sigma }\left\| \mathbf{Z }_{ij}\mathbf{Z }_{ij}^{T}\right\| =\frac{1}{\sigma }\left\| \mathbf{Z }_{ij}\right\| _{2}^{2}\). Also, we can bound \(\left\| \mathbf{Z }_{ij}\right\| _{2}\) by the inequality \(\left\| \mathbf{Z }_{ij}\right\| _{2}\le \beta\). Thus \(V_{1}\) is \(\beta\)-smooth and L-Lipschitz. Similarly, we can also show that \(V_{2}\) is \(\beta\)-smooth and L-Lipschitz. The lemma is proven.□

Fig. 2
figure 2

Illustration of Hinge loss \(\max (0, \alpha _{1}-x)\) and \(\max (0, x-\alpha _{2})\); and their smooth versions: \(V_{1}\) (left) and \(V_{2}\) (right), respectively, with \(\alpha _{1}=3, \alpha _{2}=4\) and \(\sigma =1.5\)

3.3 A stochastic learning algorithm for constrained optimization

The constrained optimization problem (13) is convex and thus guarantees to find its global optimum. Standard methods such as projected gradient descent can be used to solve the problem (13). However, for large scale data sets, solving the problem (13), involving \(n^{2}\) terms with d parameters, might be computationally expensive. For instance, the data set PROTEIN (see Table 1) has \(n^{2}> 10^6\) pairs of examples and the weight vector size of \(d> 10^5\) with the number of WL iterations \(H=5\). In this subsection, we present an efficient stochastic learning algorithm for dealing with this issue.

figure a

Let \(\mathbf{W }^{(t)}\) denote the weight at iteration t. The weight is initialized by a vector of ones: \(\mathbf{W }^{(0)}=\mathbf{1 }_{d}\), which is also the case of WWL kernel without learning subtree pattern importance. At each iteration t, we randomly pick up a pair of examples (\(z = (g,y),z^\prime =(g^\prime ,y^\prime )\)) from the training data set \(D_{n}\) and compute the gradient \(\mathrm {grad}^{(t)}\) corresponding to this pair. In fact this step can be done efficiently due to the sparsity of the feature vector \(\mathbf{Z }\left( g,g^\prime \right)\) in Eq. (12). Then, we update the current solution \(\mathbf{W }^{(t)}\) to \(\mathbf{W }^{(t+1)}\) by the following rule:

$$\begin{aligned} \mathbf{W }^{(t+1)}=\mathrm {proj}_{{\mathcal {C}}}[\mathbf{W }^{(t)}-\mu \mathrm {grad}^{(t)}] \end{aligned}$$

where \(\mathrm {proj}_{{\mathcal {C}}}[\mathbf{W }]_{h}= {\left\{ \begin{array}{ll} \mathbf{w }_{h} &{} \text {if } ||\mathbf{w }_{h}-\mathbf{c }_{h}||_{2}\le \epsilon _{h}\\ \frac{\epsilon _{h}}{||\mathbf{w }_{h}-\mathbf{c }_{h}||_{2}}(\mathbf{w }_{h}-\mathbf{c }_{h})+\mathbf{c }_{h} &{} \text {otherwise} \end{array}\right. }\) maps a point \(\mathbf{w }_{h}\) (\(1\le h \le H\)) back to the bounded feasible region. The procedure is illustrated in Algorithm 1.

4 Theoretical guarantees: a bound on generalization gap

In this section, we provide a bound on the generalization gap of the proposed stochastic learning algorithm for solving the problem (13). The gap is defined as the expected difference between the generalization error \(\textit{R}(.)\) and empirical error \(\textit{R}_{D_{n}}(\cdot )\). In order to derive the generalization bound, we first provide basic setup and notations; then prove that our learning algorithm has a uniform stability, which is established in Theorem 5 using Lemma 3 and Lemma 4; finally derive our generalization bound, which is established in Theorem 9 using the McDiarmid inequality (see Theorem 6).

4.1 Basic setup and notations

Generalization Error. Let \(\mathbf{W }_{n}\) be the parameters of the parametric function (12) obtained by training on the data set \(D_{n}\) using Algorithm 1. Then the generalization error (or risk) \(\textit{R}(\mathbf{W }_{n})\) with a loss function \(\ell\) is defined as:

$$\begin{aligned} \textit{R}(\mathbf{W }_{n})=\mathbf{E }_{z,z^\prime }[\ell (\mathbf{W }_{n}, z, z^\prime )] \end{aligned}$$

where \(\mathbf{E }_{z,z^\prime }\left[ \ell (\cdot ,\cdot ,\cdot )\right]\) denotes the expectation of function \(\ell\) when z and \(z^\prime\) are sampled according the distribution \({\mathcal {P}}\).

Empirical Error. The empirical error \(\textit{R}_{D_{n}}(\mathbf{W }_{n})\) is defined on the training data set \(D_{n}\) as :

$$\begin{aligned} \textit{R}_{D_{n}}(\mathbf{W }_{n})=\frac{1}{n^{2}}\sum _{z_{i}\in D_{n}}\sum _{z_{j}\in D_{n}}\ell (\mathbf{W }_{n}, z_{i},z_{j}) \end{aligned}$$

Expected Generalization Gap. As Algorithm 1 is based on a randomized procedure, we use the definition of the expected generalization gap as follows:

$$\begin{aligned} {\mathbb {K}}_{n}= \mathbf{E }_{\text {SGD}}[\textit{R}(\mathbf{W }_{n})-\textit{R}_{D_{n}}(\mathbf{W }_{n})] \end{aligned}$$

where \(\mathbf{E }_{\text {SGD}}\) denotes the expectation taken over the inherent randomness of the stochastic algorithm.

4.2 Uniform stability of the stochastic learning algorithm

Intuitively, a learning algorithm is said to have a uniform stability if its output is stable under a small modification of the training data set. For a randomized learning algorithm, the uniform stability property is defined as follows:

Definition 2

(Uniform Stability of the randomized algorithm) A randomized algorithm \({\mathbb {A}}\) is \(\beta _{n}\)-uniformly stable with respect to a loss function \(\ell\), if the following inequality holds:

$$\begin{aligned} \forall (D_{n}, k)\text {, } \sup _{z, z^\prime }|\mathbf{E }_{\text {SGD}}[\ell (\mathbf{W }_{n}, z, z^\prime )]-\mathbf{E }_{\text {SGD}}[\ell (\mathbf{W }_{n,k}, z, z^\prime )]|\le \beta _{n} \end{aligned}$$

where \(D_{n,k}\) is the new data set obtained from \(D_{n}\) by replacing \(z_{k}\in D_{n}\) with a new example \({\hat{z}}_{k}\) sampled from \({\mathcal {P}}\); \(\mathbf{W }_{n}\) and \(\mathbf{W }_{n,k}\) are the outputs of \({\mathbb {A}}\) trained on two data sets \(D_{n}\) and \(D_{n,k}\), respectively.

In order to prove that Algorithm 1 has the uniform stability property, we need the following lemma (its proof is placed in the "appendix" section):

Lemma 4

Let the loss function \(\ell\) defined in the problem (13) be \(\beta\)-smooth and L-Lipschitz; \(\mathbf{W }_{n}^{(T)}\) and \(\mathbf{W }_{n,k}^{(T)}\) be the parameters of the parametric form (12) trained on \(D_{n}\) and \(D_{n,k}\), respectively, using Algorithm 1 with the number of iterations T and learning rate \(\mu\). Then, the expected difference in the model parameters is upper bounded by:

$$\begin{aligned} \mathbf{E }_{\text {SGD}}\left[ \left\| \mathbf{W }_{n}^{(T)}-\mathbf{W }_{n,k}^{(T)}\right\| _{2}\right] \le \frac{4}{n}\mu T L \end{aligned}$$
(18)

Using Lemma 4 and L-Lipschitz property of function \(\ell\) (see Lemma 3), we can now prove the stability of Algorithm 1.

Theorem 5

[Uniform Stability of Algorithm 1] Let the loss function \(\ell\) defined in the problem (13) be \(\beta\)-smooth and L-Lipschitz. Then Algorithm 1 with the fixed learning rate \(\mu\) is \(k_{n}\)-uniformly stable where \(k_{n}=\frac{4}{n} \mu T L^{2}\).

Proof

We have the following inequalities:

$$\begin{aligned} |\mathbf{E }_{\text {SGD}}[\ell (\mathbf{W }_{n}, z, z^\prime )]-\mathbf{E }_{\text {SGD}}[\ell (\mathbf{W }_{n,k}, z, z^\prime )]|\le L\mathbf{E }_{\text {SGD}}\left\| \mathbf{W }_{n}-\mathbf{W }_{n,k}\right\| _{2} \le \frac{4}{n} \mu T L^{2} \end{aligned}$$
(19)

where the first and second inequalities are obtained by the L-Lipschitz property of \(\ell\) and Lemma 4, respectively. This completes the proof. □

4.3 Bound on generalization gap

Using the property of uniform stability in the previous subsection, we can derive the generalization bound which is done by the McDiarmid inequality (McDiarmid 1989).

Theorem 6

[McDiarmid inequality (McDiarmid 1989)] Let \(X_{1},X_{2},...,X_{n}\) be n independent random variables taking values in \({\mathcal {X}}\) and let \(Z=f(X_{1},...,X_{n})\). If, for each \(1\le i\le n\), there exists a constant \(c_{i}\) such that

\(\sup _{x_{1},...,x_{n},x_{i}^\prime }|f(x_{1},...,x_{n})-f(x_{1},...,x_{i}^\prime ,...,x_{n})|\le c_{i}\text {, }\forall 1\le i\le n\text {,}\)

then for any \(\epsilon >0\), \(\text {Pr}[|Z-\mathbf{E }[Z]|\ge \epsilon ]\le 2\mathrm {exp}\left( \frac{-2\epsilon ^{2}}{\sum _{i=1}^{n}c_{i}^{2}}\right)\)

To derive the bound on \(R(W_{n})\), we replace Z by \(K_{n}\) in Theorem 6 and bound \(\mathbf{E }_{\text {SGD}}\left[ {\mathbb {K}}_{n}\right]\) and \(|{\mathbb {K}}_{n}-{\mathbb {K}}_{n,k}|\) which are established by the following lemmas (see their proofs in the "appendix' section).

Lemma 7

For the loss function satisfying a uniform stability in \(k_{n}\), we have the following inequality:

$$\begin{aligned} \mathbf{E }_{D_{n}}[{\mathbb {K}}_{n}]\le 2k_{n} \end{aligned}$$
(20)

Lemma 8

For the loss function satisfying a uniform stability in \(k_{n}\) and upper bounded by M, we have the following inequality:

$$\begin{aligned} |{\mathbb {K}}_{n}-{\mathbb {K}}_{n,k}|\le 2 k_{n}+\frac{2M}{n} \end{aligned}$$
(21)

Now we can derive the generalization bound for \(R(\mathbf{W }_{n})\) in the following theorem:

Theorem 9

Let \(D_{n}\) be a training data set with n samples, \(\mathbf{W }_{n}\) be the solution obtained by minimizing the optimization problem (13) using Algorithm 1 with uniform stability \(k_{n}\). Then the following inequality holds for probability of at least \(1-\delta\) \(\left( 0\le \delta \le 1\right)\):

$$\begin{aligned} \mathbf{E }_{\text {SGD}}\left[ R(\mathbf{W }_{n})-\textit{R}_{D_{n}}(\mathbf{W }_{n}) \right] \le 2 k_{n} + (n k_{n} + M)\sqrt{\frac{2}{n}\mathrm {log}\frac{2}{\delta }} \end{aligned}$$
(22)

Proof

Applying McDiarmid’s concentration inequality (6) by replacing Z with \({\mathbb {K}}_{n}\), we have:

$$\begin{aligned} Pr({\mathbb {K}}_{n} - \mathbf{E }_{D_{n}}\left[ {\mathbb {K}}_{n}\right] \ge \epsilon )\le 2\mathrm {exp}\left( \frac{-2\epsilon ^{2}}{n\left( 2 k_{n}+\frac{2M}{n}\right) ^{2}}\right) \end{aligned}$$

By fixing \(\delta =2\mathrm {exp}\left( \frac{-2\epsilon ^{2}}{n\left( 2 k_{n}+\frac{2M}{n}\right) ^{2}}\right)\), we get \(\epsilon =(n k_{n} + M)\sqrt{\frac{2}{n}\mathrm {log}\frac{1}{\delta }}\) which completes the proof of Theorem 9. □

The generalization bound is meaningful if the bound converge to 0 as \(n\rightarrow \infty\). Our derived generalization bound converges to 0 as \(k_{n}\) decays with \(O(\frac{1}{n})\). This confirms Algorithm 1 converges.

5 Experiments

In this section, we demonstrate the benefit of learning subtree pattern weights by experiments on both synthetic and real-world data. We performed classification experiments using the C-SVM implementation LIBSVM (Chang and Lin 2011). The necessary parameters of SVM were selected by cross-validation on the training set. These are the regularization parameter \(C \in \{10^{-3},10^{-2},..., 10^{2},10^{3}\}\) and kernel parameter \(\gamma \in \{0.0001, 0.001,0.01\}\). For learning the weights \(\mathbf{W }\) of subtree patterns (or WL labels) in Algorithm 1, the learning rate \(\mu\) and maximum number of iterations T were set as 0.0001 and 500, respectively; \(\epsilon _{h}\in \{0.1, 0.5, 1.0\}\) were selected by cross validation based on the training set and \(\mathbf{c }_{h}\) was fixed as a vector of ones, i.e. \(\mathbf{c }_{h}=\mathbf{1 }_{|\Sigma ^{h}|}\) for \(h=1,..,H\); the hyperparameters \(\alpha _{1}\), \(\alpha _{2}\) and \(\sigma\) were empirically determined as 1.0, 0.5 and 0.1, respectively. All kernels were implemented in Python 3.0 and experiments were conducted on an Intel Core i9 at 2.3 Ghz with 64GB of RAM using a single processor only. The source code can be accessed through https://github.com/haidnguyen0909/weightedWWL.

5.1 Synthetic data

We designed eight substructures, shown in Fig. 3, in which substructures indexed by 1, 2, 5 and 6 are assumed to be indicative to positive class (+ 1) as they have a pattern ’1–0(− 2)–0’ in common. The others are indicative to negative class (− 1). Our synthetic data set consists of eight groups of graphs, each corresponds to one of these eight substructures by randomly adding noisy nodes and edges. We used groups corresponding substructures 1, 2, 3 and 4 as training data and the others as testing data. We constructed two kernels for graphs: WWL and the proposed method with number of WL iterations \(H=2\), then used SVM for classification. We reported mean accuracy obtained by ten synthetic data sets generated in this way.

We observed that WWL obtained mean accuracies of 82.4%, while the proposed method achieved significantly higher accuracy of 95%. It is noted that the testing examples were confusing the classifier. For instance, the substructure 5 has the same similarity with substructures 2 (indicative to positive class) and 4 (indicative to negative class) according to the WWL kernel, making it hard for the classifier to distinguish graphs containing the substructure 5. In contrast, this confusion can be alleviated by learning subtree pattern weights. In particular, the pattern ’1–0(− 2)–0’ present in the substructure 5 is assigned a high weight by the proposed method (see Fig. 4). Therefore, graphs generated from the substructure 5 are more likely to be classified as positive.

Fig. 3
figure 3

Designed substructures 1–8: substructures 1, 2, 5 and 6 are indicative to positive class as they contain pattern ’1–0(− 2)–0’ (emphasized in yellow). The rest are indicative to negative class. Graphs are generated from substructures by adding random nodes and noisy edges (20 examples for each substructure). Graphs generated from the substructures 1, 2, 3 and 4 are used for training. Graph generated from the substructures 5, 6, 7 and 8 are used for testing

Fig. 4
figure 4

Examples of selected subtree patterns of the first level \(h=1\) (left) and their weights learned by the proposed learning algorithm (right)

5.2 Real-world data

In this subsection we present an experimental evaluation of the proposed method on real-world data. We report experimental results on four benchmark bioinformatics data sets, involving node-labeled graphs, particularly, MUTAG, PTC-MR, PROTEIN and NCI1. The MUTAG dataset consists of graph structures of 188 chemical compounds which are either mutagenic aromatic or heteroromatic nitro compounds and nodes can have 7 discrete labels. The PTC-MR dataset consists of 344 chemical compounds which are known to cause or not cause cancer in rats and mice, and nodes can have 19 discrete labels. The PROTEIN dataset consists of relations between secondary structure elements represented by nodes and neighborhood in the amino-acid sequence or in 3D space by edges, and nodes can have 3 discrete labels. The NCI1 dataset is a balanced subset of chemical compounds screened for their ability to inhibit the growth of a panel of human tumor cell lines, and nodes can have 37 discrete labels. Some statistics of these data sets are shown in Table 1.

We compared the proposed method to several state-of-the-art graph kernels. Due to the large number of graph kernels in the literature, we selected representatives of the major families of graph kernels. In particular, for the family of walk based kernels, we compared the proposed method to the fast random walk kernel (Kashima 2003) that essentially counts the common labeled walks. For the family of path based graph kernels, we compared to the shortest path kernel (Borgwardt and Kriegel 2005). For the family of WL based graph kernels, we compared to WL subtree (Vishwanathan et al. 2010), WL-OA (Kriege et al. 2016) and WWL kernels (Togninalli et al. 2019). The WL based kernels have been shown to be superior to previous approaches.

Table 1 Statistics of datasets used in experiments
Fig. 5
figure 5

Examples of selected subtree patterns and their weights of the first level \(h=1\) a and the second level \(h=2\) b extracted from MUTAG by Algorithm 1

Fig. 6
figure 6

Examples of selected subtree patterns and their weights of the first level \(h=1\) a and the second level \(h=2\) b extracted from PTC-MR by Algorithm 1

We report mean predictive accuracies and standard deviations obtained by 10-fold cross-validation repeated 10 times with random fold assignments. Within each fold, the number of hops \(H\in \{1,2,...,6\}\) was selected by cross validation based on the training set. The results evaluated by classification accuracy are summarised in Table 2. We used one-sided paired t-test to verify if the accuracy differences between two methods on data sets are statistically significant. We empirically observed that random walk and shortest path kernels were less competitive to WL-based kernels on four data sets. On three datasets MUTAG, PROTEIN and NCI1, the proposed method was comparable with WL-OA while it improved its unweighted version WWL by 1.4%, 1.5% and 0.8%, respectively (the calculated p values were 0.061, 0.0087 and 0.055, respectively, smaller than the significance level of \(\alpha =0.1\)). On PTC-MR, the proposed method improved WWL by 0.6% while outperforming WL-OA by nearly 5% (the calculated p values were 0.0035 and < 0.001, respectively). In all these data sets, random walk, shortest path and WL subtree kernels were dominated by the rest in large margins. Furthermore, we also investigated some selected subtree patterns at the first and second levels (\(h=1,2\)) along with their weights learned by the proposed algorithm from two data sets: MUTAG and PTC-MR (see Figs. 5 and 6, respectively). Interestingly, the weights were found different over substructures, indicating their different degrees of importance in the prediction task. These experimental results confirmed the effectiveness of learning important subtree patterns for WWL kernels.

Table 2 Classification accuracies and standard deviation on real-world graph data sets: MUTAG, PTC-MR, PROTEIN and NCI1

5.3 Computational efficiency of the proposed stochastic algorithm

In this subsection, we evaluate the computational efficiency of the proposed Algorithm 1. We empirically compared two variants: batch and stochastic (Algorithm 1), for solving the minimization problem (13) in terms of running time. The first variant considers all pairs of graphs for every step of projected gradient descent. The second variant considers one pair of graphs at a time to take a single step.

First we assessed the running time of two variants on randomly generated graphs (as described in Subsect. 5.1) with respect to two parameters: number of graphs N and number of WL iterations H. We varied N in range \(\{50,100,200,\) \(400,600, 800,1000\}\) and H in range \(\{1,2,3,4,5,6,7\}\). For each individual experiment, we fixed one parameter at its default value and varied the other. The default values were 100 for N and 2 for H. We report CPU running times in seconds in Fig. 7. Empirically, we observed that the running time of full batch variant increased quickly when increasing the number of graphs N and the number of WL iterations H. In contrast, the stochastic variant scaled much better with much lower running times, indicating that Algorithm 1 has high scalability for large scale data sets. The computational efficiency of Algorithm 1 can be explained by the fact that computing gradient of the loss function for a graph pair g and \(g^\prime\) involves a few substructures (or WL labels) v shared by g and \(g^\prime\) in Eq. (12), i.e., sparsity of the feature vector \(\mathbf{Z }\left( g,g^\prime \right)\).

Second we assessed the running time of two variants on real-world data sets: MUTAG, PTC-MR, PROTEIN and NCI1. We reported the running time of two variants to finish the entire classification tasks, including learning subtree pattern weights, computing kernels and doing classification, in Table 3. The running time were taken average by 10-fold cross validation. We empirically observed that the full batch variant was slow when running on even small data sets MUTAG and PTC-MR, taking in approximately 4 h and 9 h, respectively. But the stochastic variant was much faster, taking only less than 4 min on the two data sets. We also observed that the stochastic variant could easily scale up to data sets with thousands of graphs. Particularly, on data sets PROTEIN and NCI1, the tasks were performed in nearly 1h 30’ and 5 h, respectively. However, it was impossible for the full batch variant to finish the tasks in less than 3 days. These evidence showed that the proposed stochastic variant is highly scalable.

Fig. 7
figure 7

Running time in seconds on synthetic data sets of two variants: full batch and stochastic algorithms, for learning the subtree pattern importance W (in the optimization problem (13)) (Default values: dataset size \(N = 100\), WL iteration \(H = 2\))

Table 3 Running time in seconds of two variants: full batch and stochastic on real-world data sets: MUTAG, PTC-MR, PROTEIN and NCI1

6 Conclusion and discussion

In this work, we proposed to learn the weights of substructures of graphs, particularly, subtree patterns (extracted by WL labeling scheme), to overcome the limitations of current graph kernels. We considered the problem of incorporating subtree pattern weights for WWL kernel (Togninalli et al. 2019) by formulating the parametric form of Wasserstein distance taking into account subtree pattern weights, and learning these weights from data optimally for the tasks.

Our proposed method has several advantages. First, it can learn the importance of subtree patterns specifically for the tasks through their weights in the parametric distance function. Second, the kernels converted from the learned parametric function of subtree pattern weights are valid. Third, the efficient stochastic algorithm for learning the weights has high scalability for large scale data sets, and its theoretical guarantees are provided.

Although we considered WWL kernel for extracting important subtree patterns, an interesting and worthwhile extension of our work would be to apply this idea to other WL based graph kernels such as WL subtree and WL-OA kernels. The improvements of the optimization algorithm for learning subtree pattern weights in terms of both convergence and efficiency would also be our future work.