A unifying view of explicit and implicit feature maps of graph kernels
 252 Downloads
Abstract
Nonlinear kernel methods can be approximated by fast linear ones using suitable explicit feature maps allowing their application to large scale problems. We investigate how convolution kernels for structured data are composed from base kernels and construct corresponding feature maps. On this basis we propose exact and approximative feature maps for widely used graph kernels based on the kernel trick. We analyze for which kernels and graph properties computation by explicit feature maps is feasible and actually more efficient. In particular, we derive approximative, explicit feature maps for stateoftheart kernels supporting realvalued attributes including the GraphHopper and graph invariant kernels. In extensive experiments we show that our approaches often achieve a classification accuracy close to the exact methods based on the kernel trick, but require only a fraction of their running time. Moreover, we propose and analyze algorithms for computing random walk, shortestpath and subgraph matching kernels by explicit and implicit feature maps. Our theoretical results are confirmed experimentally by observing a phase transition when comparing running time with respect to label diversity, walk lengths and subgraph size, respectively.
Keywords
Graph kernels Feature maps Random walk kernel Structured data Supervised learning1 Introduction
Analyzing complex data is becoming more and more important. In numerous application domains, e.g., chem and bioinformatics, neuroscience, or image and social network analysis, the data is structured and hence can naturally be represented as graphs. To achieve successful learning we need to exploit the rich information inherent in the graph structure and the annotations of vertices and edges. A popular approach to mining structured data is to design graph kernels measuring the similarity between pairs of graphs. The graph kernel can then be plugged into a kernel machine, such as support vector machine or Gaussian process, for efficient learning and prediction.
The kernelbased approach to predictive graph mining requires a positive semidefinite (p.s.d.) kernel function between graphs. Graphs, composed of labeled vertices and edges, possibly enriched with continuous attributes, however, are not fixedlength vectors but rather complicated data structures, and thus standard kernels cannot be used. Instead, the general strategy to design graph kernels is to decompose graphs into small substructures among which kernels are defined following the concept of convolution kernels due to Haussler (1999). The graph kernel itself is then a combination of the kernels between the possibly overlapping parts. Hence the various graph kernels proposed in the literature mainly differ in the way the parts are constructed and in the similarity measure used to compare them. Moreover, existing graph kernels differ in their ability to exploit annotations, which may be categorical labels or realvalued attributes on the vertices and edges.
 (i)
One way is functional computation, e.g., from closedform expressions. In this case the feature map is not necessarily known and the feature space may be of infinite dimension. Therefore, we refer to this approach based on the famous kernel trick as implicit computation.
 (ii)
The other strategy is to compute the feature map \(\phi (G)\) for each graph Gexplicitly to obtain the kernel values from the dot product between pairs of feature vectors. These feature vectors commonly count how often certain substructures occur in a graph.
Previously proposed graph kernels that are computed implicitly typically support specifying arbitrary kernels for vertex annotations, but do not scale to large graphs and data sets. Even when approximative explicit feature maps of the kernel on vertex annotations are known, it is not clear how to obtain (approximative) feature maps for the graph kernel.
Our contribution We study under which conditions the computation of an explicit mapping from graphs to a finitedimensional feature spaces is feasible and efficient. To achieve our goal, we discuss feature maps corresponding to closure properties of kernels and general convolution kernels with a focus on the size and sparsity of their feature vectors. Our theoretical analysis identifies a tradeoff between running time and flexibility.
Building on the systematic construction of feature maps we obtain new algorithms for explicit graph kernel computation, which allow to incorporate (approximative) explicit feature maps of kernels on vertex annotations. Thereby known approximation results for kernels on continuous data are lifted to kernels for graphs with continuous annotations. More precisely, we introduce the class of weighted vertex kernels and show that it generalizes stateoftheart kernels for graphs with continuous attributes, namely the GraphHopper kernel (Feragen et al. 2013) and an instance of the graph invariant kernels (Orsini et al. 2015). We derive explicit feature maps with approximation guarantees based on approximative feature maps of the base kernels to compare annotations. Then, we propose and analyze algorithms for computing fixed length walk kernels by explicit and implicit feature maps. We investigate shortestpath kernels (Borgwardt and Kriegel 2005) and subgraph matching kernels (Kriege and Mutzel 2012) and put the related work into the context of our systematic study. Given this, we are finally able to experimentally compare the running times of both computation strategies systematically with respect to the label diversity, data set size, and substructure size, i.e., walk length and subgraph size. As it turns out, there exists a computational phase transition for walk and subgraph kernels. Our experimental results for weighted vertex kernels show that their computation by explicit feature maps is feasible and provides a viable alternative even when comparing graphs with continuous attributes.

Feature maps of composed kernels We review closure properties of kernels, the corresponding feature maps and the size and sparsity of the feature vectors. Based on this, we obtain explicit feature maps for convolution kernels with arbitrary base kernels. This generalizes the result of the conference paper, where binary base kernel were considered.

Weighted vertex kernels We introduce weighted vertex kernels for attributed graphs, which generalize the GraphHopper kernel (Feragen et al. 2013) and graph invariant kernels (Orsini et al. 2015). Weighted vertex kernels were not considered in the conference paper.

Construction of explicit feature maps We derive explicit feature maps for weighted vertex kernels and the shortestpath kernel (Borgwardt and Kriegel 2005) supporting base kernels with explicit feature maps for the comparison of attributes. We prove approximation guarantees in case of approximative feature maps of base kernels. This contribution is not contained in the conference paper, where only the explicit computation of the shortestpath kernel for graphs with discrete labels was discussed.

Fixed length walk kernels We generalize the explicit computation scheme to support arbitrary vertex and edge kernels with explicit feature maps for the comparison of attributes. In the conference paper only binary kernels were considered. Moreover, we have significantly expanded the section on walk kernels by spelling out all proofs, adding illustrative figures and clarifying the relation to the kstep random walk kernel as defined by Sugiyama and Borgwardt (2015).

Experimental evaluation We largely extended our evaluation, which now includes experiments for the novel computation schemes of graph kernels as well as a comparison between a graphlet kernel and the subgraph matching kernel (Kriege and Mutzel 2012).
2 Related work
In the following we review existing kernels based on explicit or implicit computation and discuss embedding techniques for attributed graphs. We focus on the approaches most relevant for our work and refer the reader to the survey articles (Vishwanathan et al. 2010; Ghosh et al. 2018; Zhang et al. 2018b; Kriege 2019) for a more comprehensive overview.
2.1 Graph kernels
Most graph kernels decompose graphs into substructures and count their occurrences to obtain a feature vector. The kernel function then counts the cooccurrences of features in two graphs by taking the dot product between their feature vectors. The graphlet kernel, for example, counts induced subgraphs of size \(k \in \{3,4,5\}\) of unlabeled graphs according to \(K(G,H) = {\mathbf {f}}_{G}^\top {\mathbf {f}}_{H}\), where \({\mathbf {f}}_{G}\) and \({\mathbf {f}}_{H}\) are the subgraph feature vectors of G and H, respectively (Shervashidze et al. 2009). The cyclic pattern kernel is based on cycles and trees and maps the graphs to substructure indicator features, which are independent of the substructure frequency (Horváth et al. 2004). The WeisfeilerLehman subtree kernel counts labelbased subtree patterns according to \(K_d(G, H) = \sum _{i = 1}^h K(G_i, H_i)\), where \(K(G_i, H_i) = \langle {\mathbf {f}}^{(i)}_{G}{\mathbf {f}}^{(i)}_{H} \rangle \) and \({\mathbf {f}}^{(i)}_{G}\) is a feature vector counting subtreepatterns in G of depth i (Shervashidze et al. 2011; Shervashidze and Borgwardt (2009)). A subtreepattern is a tree rooted at a particular vertex where each level contains the neighbors of its parent vertex; the same vertices can appear repeatedly. Other graph kernels on subtreepatterns have been proposed in the literature, e.g., Ramon and Gärtner (2003), Harchaoui and Bach (2007), Bai et al. (2015) and Hido and Kashima (2009). In a similar spirit, the propagation kernel iteratively counts similar label or attribute distributions to create an explicit feature map for efficient kernel computation (Neumann et al. 2016). Martino et al. (2012) proposed to decompose graphs into multisets of ordered directed acyclic graphs, which are compared by extended tree kernels. While convolution kernels decompose graphs into their parts and sum over all pairs, assignment kernels are obtained from an optimal bijection between parts (Fröhlich et al. 2005). Since this does not lead to valid kernels in general (Vert 2008; Vishwanathan et al. 2010), various approaches to overcome this obstacle have been developed (Johansson and Dubhashi 2015; Schiavinato et al. 2015; Kriege et al. 2016; Nikolentzos et al. 2017). Several kernels have been proposed with the goal to take graph structure at different scales into account, e.g., using kcore decomposition (Nikolentzos et al. 2018) or spectral properties (Kondor and Pan 2016). Yanardag and Vishwanathan (2015) combine neural techniques from language modeling with stateoftheart graph kernels in order to incorporate similarities between the individual substructures. Such similarities were specifically designed for the substrutures used by the graphlet and the WeisfeilerLehman subtree kernel, among others. Narayanan et al. (2016) discuss several problems of the proposed approach to obtain substructure similarities and introduce subgraph2vec to overcome these issues.
Many realworld graphs have continuous attributes such as realvalued vectors attached to their vertices and edges. For example, the vertices of a molecular graph may be annotated by the physical and chemical properties of the atoms they represent. The kernels based on counting cooccurrences described above, however, consider two substructures as identical if they match exactly, structurewise as well as attributewise, and as completely different otherwise. For attributed graphs it is desirable to compare annotations by more complex similarity measures such as the Gaussian RBF kernel. The kernels discussed in the following allow userdefined kernels for the comparison of vertex and edge attributes. Moreover, they compare graphs in a way that takes the interplay between structure and attributes into account and are therefore suitable for graphs with continuous attributes.
Another substructure used to measure the similarity among graphs are shortest paths. Borgwardt and Kriegel (2005) proposed the shortestpath kernel, which compares two graphs based on vertex pairs with similar shortestpath lengths. The GraphHopper kernel compares the vertices encountered while hopping along shortest paths by a userspecified kernel (Feragen et al. 2013). Similar to the graphlet kernel, the subgraph matching kernel compares subgraphs of small size, but allows to score mappings between them according to vertex and edge kernels (Kriege and Mutzel 2012). Further kernels designed specifically for graphs with continuous attributes exist (Orsini et al. 2015; Su et al. 2016; Martino et al. 2018).
2.2 Embedding techniques for attributed graphs
Kernels for attributed graphs often allow to specify arbitrary kernels for comparing attributes and are computed using the kernel trick without generating feature vectors. Moreover, several approaches for computing vector representations for attributed graphs have been proposed. These, however, do not allow specifying a function for comparing attributes. The similarity measure that is implicitly used to compare attributes is typically not known. This is the case for recent deep learning approaches as well as for some kernels proposed for attributed graphs.
Deep learning on graphs Recently, a number of approaches to graph classification based upon neural networks have been proposed. Here a vectorial representation for each vertex is learned iteratively from the vertex annotations of its neighbors using a parameterized (differentiable) neighborhood aggregation function. Eventually, the vector representations for the individual vertices are combined to obtain a vector representation for the graph, e.g., by summation.
The parameters of the aggregation function are learned together with the parameters of the classification or regression algorithm, e.g., a neural network. More refined approaches use differential pooling operators based on sorting (Zhang et al. 2018a) and soft assignments (Ying et al. 2018b). Most of these neural approaches fit into the framework proposed by Gilmer et al. (2017). Notable instances of this model include neural fingerprints (Duvenaud et al. 2015), GraphSAGE (Hamilton et al. 2017a), and the spectral approaches proposed by Bruna et al. (2014), Defferrard et al. (2016) and Kipf and Welling (2017)—all of which descend from early work, see, e.g., Merkwirth and Lengauer (2005) and Scarselli et al. (2009).
These methods show promising results on several graph classification benchmarks, see, e.g., Ying et al. (2018b), as well as in applications such as protein–protein interaction prediction (Fout et al. 2017), recommender systems (Ying et al. 2018a), and the analysis of quantum interactions in molecules (Schütt et al. 2017). A survey of recent advancements can be found in Hamilton et al. (2017b). With these approaches, the vertex attributes are aggregated for each graph and not directly compared between the graphs. Therefore, it is not obvious how the similarity of vertex attributes is measured.
Explicit feature maps of kernels for attributed graphs Graph kernels supporting complex annotations typically use implicit computation schemes and do not scale well. Whereas graphs with discrete labels are efficiently compared by graph kernels based on explicit feature maps. Kernels limited to graphs with categorical labels can be applied to attributed graphs by discretization of the continuous attributes, see, e.g., Neumann et al. (2016). Morris et al. (2016) proposed the hash graph kernel framework to obtain efficient kernels for graphs with continuous labels from those proposed for discrete ones. The idea is to iteratively turn continuous attributes into discrete labels using randomized hash functions. A drawback of the approach is that socalled independent khash families must be known to guarantee that the approach approximates attribute comparisons by the kernel k. In practice localitysensitive hashing is used, which does not provide this guarantee, but still achieves promising results. To the best of our knowledge no results on explicit feature maps of kernels for graphs with continuous attributes that are compared by a welldefined similarity measure such as the Gaussian RBF kernel are known.
However, explicit feature maps of kernels for vectorial data have been studied extensively. Starting with the seminal work by Rahimi and Recht (2008), explicit feature maps of various popular kernels have been proposed, cf. (Vedaldi and Zisserman 2012; Kar and Karnick 2012; Pham and Pagh 2013, and references therein). In this paper, we build on this line of work to obtain kernels for graphs, where individual vertices and edges are annotated by vectorial data. In contrast to the hash graph kernel framework our goal is to lift the known approximation results for kernels on vectorial data to kernels for graphs annotated with vectorial data.
3 Preliminaries
An (undirected) graphG is a pair (V, E) with a finite set of verticesV and a set of edges\(E \subseteq \{ \{u,v\} \subseteq V \mid u \ne v \}\). We denote the set of vertices and the set of edges of G by V(G) and E(G), respectively. For ease of notation we denote the edge \(\{u,v\}\) in E(G) by uv or vu and the set of all graphs by \(\mathcal {G}\). A graph \(G' = (V',E')\) is a subgraph of a graph \(G=(V,E)\) if \(V' \subseteq V\) and \(E' \subseteq E\). The subgraph \(G'\) is said to be induced if \(E' = \{uv \in E \mid u,v \in V' \}\) and we write \(G' \subseteq G\). We denote the neighborhood of a vertex v in V(G) by \({{\,\mathrm{N}\,}}(v) = \{ u \in V(G) \mid vu \in E(G) \}\).
A labeled graph is a graph G endowed with an label function\(\tau :V(G) \rightarrow \varSigma \), where \(\varSigma \) is a finite alphabet. We say that \(\tau (v)\) is the label of v for v in V(G). An attributed graph is a graph G endowed with a function \(\tau :V(G) \rightarrow \mathbb {R} ^d\), \(d \in \mathbb {N}^{+}\), and we say that \(\tau (v)\) is the attribute of v. We denote the base kernel for comparing vertex labels and attributes by \(k_V\) and, for short, write \(k_V(u,v)\) instead of \(k_V(\tau (u),\tau (v))\). The above definitions directly extend to graphs, where edges have labels or attributes and we denote the base kernel by \(k_E\). We refer to \(k_V\) and \(k_E\) as vertex kernel and edge kernel, respectively, and assume both to take nonnegative values only.
Let \({\mathsf {T}}_k\) be the running time for evaluating a kernel for a pair of graphs, \({\mathsf {T}}_\phi \) for computing a feature vector for a single graph and \({\mathsf {T}}_{\text {dot}}\) for computing the dot product between two feature vectors. Computing an \(n \times n\) matrix with all pairwise kernel values for n graphs requires (i) time \(\mathcal {O}(n^2 {\mathsf {T}}_k)\) using implicit feature maps, and (ii) time \(\mathcal {O}(n {\mathsf {T}}_{\phi } + n^2 {\mathsf {T}}_{\text {dot}})\) using explicit feature maps. Clearly, explicit computation can only be competitive with implicit computation, when the time \({\mathsf {T}}_{\text {dot}}\) is smaller than \({\mathsf {T}}_k\). In this case, however, even a timeconsuming feature mapping \({\mathsf {T}}_\phi \) pays off with increasing data set size. The running time \({\mathsf {T}}_{\text {dot}}\) depends on the data structure used to store feature vectors. Since feature vectors for graph kernels often contain many components that are zero, we consider sparse data structures, which expose running times depending on the number of nonzero components instead of the actual number of all components. For a vector v in \(\mathbb {R} ^d\), we denote by \(\mathsf {nz}(v)\) the set of indices of the nonzero components of v and let \(\mathsf {nnz}(v) = \mathsf {nz}(v)\) the number of nonzero components. Using hash tables the dot product between \(\varPhi _1\) and \(\varPhi _2\) can be realized in time \({\mathsf {T}}_{\text {dot}}=\mathcal {O}(\min \{\mathsf {nnz}(\varPhi _1), \mathsf {nnz}(\varPhi _2)\})\) in the average case.
4 Basic kernels, composed kernels and their feature maps
Graph kernels, in particular those supporting userspecified kernels for annotations, typically employ closure properties. This allows to decompose graphs into parts that are eventually the annotated vertices and edges. The graph kernel then is composed of base kernels applied to the annotations and annotated substructures, respectively. We first consider the explicit feature maps of basic kernels and then review closure properties of kernels and discuss how to obtain their explicit feature maps. The results are summarized in Table 1. This forms the basis for the systematic construction of explicit feature maps of graph kernels according to their composition of base kernels later in Sect. 5.
Composed kernels, their feature map, dimension and sparsity
Kernel  Feature map  Dimension  Sparsity 

\(k^{\alpha }(x,y) = \alpha k(x,y)\)  \(\phi ^{\alpha }(x) = \sqrt{\alpha }\phi (x)\)  d  \(\mathsf {nnz}(\phi (x))\) 
\(k^{+}(x,y) = \sum _{i =1}^{D}k_i(x,y)\)  \(\phi ^{+}(x) = \bigoplus _{i=1}^D \phi _i(x)\)  \(\sum ^D_{i=1} d_i\)  \(\sum _{i=1}^{D} \mathsf {nnz}(\phi _i(x))\) 
\(k^{\bullet }(x,y) = \prod _{i =1}^{D}k_i(x,y)\)  \(\phi ^{\bullet }(x) = \bigotimes _{i=1}^D \phi _i(x)\)  \(\prod ^D_{i=1} d_i\)  \(\prod _{i=1}^{D} \mathsf {nnz}(\phi _i(x))\) 
\(k^\times (X,Y) = \sum _{x \in X} \sum _{y \in Y} k(x,y)\)  \(\phi ^{\times }(X) = \sum _{x \in X} \phi (x)\)  d  \(\left \bigcup _{x \in X} \mathsf {nz}(\phi (x))\right \) 
4.1 Dirac and binary kernels
We discuss feature maps for basic kernels often used for the construction of kernels on structured objects. The Dirac kernel \(k_\delta \) on \(\mathcal {X} \) is defined by \(k_\delta (x,y) = 1\), if \(x=y\) and 0 otherwise. For \(\mathcal {X}\) a finite set, it is wellknown that \(\phi :\mathcal {X} \rightarrow \{0,1\}^{\mathcal {X} }\) with components indexed by \(i \in \mathcal {X} \) and defined as \(\phi (x)_i = 1\) if \(i=x\), and 0 otherwise, is a feature map of the Dirac kernel.
Lemma 1
Let k be a binary kernel on \(\mathcal {X} \), then \(x \sim _k y \Longrightarrow x \sim _k x\) holds for all \(x,y \in \mathcal {X} \).
Proof
Assume there are \(x,y \in \mathcal {X} \) such that \(x \not \sim _k x\) and \(x \sim _k y\). By the definition of \(\sim _k\) we obtain \(k(x,x)=0\) and \(k(x,y)=1\). The symmetric kernel matrix obtained by k for \(X=\{x,y\}\) thus is either \(\bigl ({\begin{matrix} 0&{}1\\ 1&{}0 \end{matrix}} \bigr )\) or \(\bigl ({\begin{matrix} 0&{}1\\ 1&{}1 \end{matrix}} \bigr )\), where we assume that the first row and column is associated with x. Both matrices are not p.s.d. and, thus, k is not a kernel contradicting the assumption. \(\square \)
Lemma 2
Let k be a binary kernel on \(\mathcal {X} \), then \(\sim _k\) is a partial equivalence relation meaning that the relation \(\sim _k\) is (i) symmetric, and (ii) transitive.
Proof
Property (i) follows from the fact that k must be symmetric according to definition. Assume property (ii) does not hold. Then there are \(x,y,z \in \mathcal {X} \) with \(x \sim _k y \wedge y \sim _k z\) and \(x \not \sim _k z\). Since \(x \ne z\) must hold according to Lemma 1 we can conclude that \(X=\{x,y,z\}\) are pairwise distinct. We consider a kernel matrix \(\mathbf {K}\) obtained by k for X and assume that the first, second and third row as well as column is associated with x, y and z, respectively. There must be entries \(k_{12}=k_{21}=k_{23}=k_{32}=1\) and \(k_{13}=k_{31}=0\). According to Lemma 1 the entries of the main diagonal \(k_{11}=k_{22}=k_{33}=1\) follow. Consider the coefficient vector \(\mathbf {c}\) with \(c_1=c_3=1\) and \(c_2=1\), we obtain \(\mathbf {c}^\top \mathbf {K} \mathbf {c} = 1.\) Hence, \(\mathbf {K}\) is not p.s.d. and k is not a kernel contradicting the assumption. \(\square \)
We use these results to construct a feature map for a binary kernel. We restrict our consideration to the set \(\mathcal {X} _{{{\,\mathrm{ref}\,}}} = \{ x \in \mathcal {X} \mid x \sim _k x \}\), on which \(\sim _k\) is an equivalence relation. The quotient set \(\mathcal {Q}_k=\mathcal {X} _{{{\,\mathrm{ref}\,}}}/\!\!\sim _k\) is the set of equivalence classes induced by \(\sim _k\). Let \([x]_k\) denote the equivalence class of \(x \in \mathcal {X} _{{{\,\mathrm{ref}\,}}}\) under the relation \(\sim _k\). Let \(k_\delta \) be the Dirac kernel on the equivalence classes \(\mathcal {Q}_k\), then \(k(x,y) = k_\delta ([x]_k,[y]_k)\) and we obtain the following result.
Proposition 1
Let k be a binary kernel with \(\mathcal {Q}_k = \{Q_1,\dots ,Q_d\}\), then \(\phi :\mathcal {X} \rightarrow \{0,1\}^{d}\) with \(\phi (x)_i = 1\) if \(Q_i=[x]_k\), and 0 otherwise, is a feature map of k.
4.2 Closure properties
For a kernel k on a nonempty set \(\mathcal {X} \) the function \(k^{\alpha }(x,y) = \alpha k(x,y)\) with \(\alpha \) in \(\mathbb {R} _{\ge 0}\) is again a kernel on \(\mathcal {X} \). Let \(\phi \) be a feature map of k, then \(\phi ^{\alpha }(x) = \sqrt{\alpha }\phi (x)\) is a feature map of \(k^{\alpha }\). For addition and multiplication, we get the following result.
Proposition 2
(ShaweTaylor and Cristianini 2004, pp. 75 sqq.)
Remark 1
In case of \(k_1 = k_2 = \dots = k_D\), we have \(k^{+}(x,y)=D k_1(x,y)\) and a \(d_1\)dimensional feature map can be obtained. For \(k^{\bullet }\) we have \(k_1(x,y)^D\), which yet does not allow for a feature space of dimension smaller than \(d_1^D\) in general.
We state an immediate consequence of Proposition 2 regarding the sparsity of the obtained feature vectors explicitly.
Corollary 1
4.3 Kernels on sets
Corollary 2
A crucial observation is that the number of nonzero components of a feature vector depends on both, the cardinality and structure of the set X and the feature map \(\phi \) acting on the elements of X. It is as large as possible when each element of X is mapped by \(\phi \) to a feature vector with distinct nonzero components.
4.4 Convolution kernels
Haussler (1999) proposed Rconvolution kernels as a generic framework to define kernels between composite objects. In the following we derive feature maps for such kernels by using the basic closure properties introduced in the previous sections. Thereby, we generalize the result presented in (Kriege et al. 2014).
Definition 1
Kriege et al. (2014) considered the special case that \(\kappa \) is a binary kernel, cf. Sect. 4.1. From Proposition 1 and Eq. (5) we directly obtain their result as special case.
Corollary 3
(Kriege et al. (2014), Theorem 3) Let \(k^\star \) be an Rconvolution kernel with binary kernel \(\kappa \) and \(\mathcal {Q}_\kappa = \{Q_1,\dots ,Q_d\}\), then \(\phi ^\star :\mathcal {X} \rightarrow \mathbb {N} ^d\) with \(\phi (x)_i = Q_i \cap X\) is a feature map of \(k^\star \).
5 Computing graph kernels by explicit and implicit feature maps
Building on the systematic construction of feature maps of kernels, we discuss explicit and implicit computation schemes of graph kernels. We first introduce weighted vertex kernels. This family of kernels generalizes the GraphHopper kernel (Feragen et al. 2013) and graph invariant kernels (Orsini et al. 2015) for attributed graphs, which were recently proposed with an implicit method of computation. We derive (approximative) explicit feature maps for weighted vertex kernels. Then, we develop explicit and implicit methods of computation for fixed length walk kernels, which both exploit sparsity for efficient computation. Finally, we discuss shortestpath and subgraph kernels for which both computation schemes have been considered previously and put them in the context of our systematic study. We empirically study both computation schemes for graph kernels confirming our theoretical results experimentally in Sect. 6.
5.1 Weighted vertex kernels
5.1.1 Weight kernels
We discuss two kernels for attributed graphs, which have been proposed recently and can bee seen as instances of weighted vertex kernels.
5.1.2 Vertex kernels
5.1.3 Computing explicit feature maps
In the following we derive an explicit mapping for weighted vertex kernels. Notice that Eq. (6) is an instance of Definition 1. Hence, by Proposition 2 and Eq. (5), we obtain an explicit mapping \(\phi ^\text {WV}\) of weighted vertex kernels.
Proposition 3
Widely used kernels for the comparison of attributes, such as the Gaussian RBF kernel, do not have feature maps of finite dimension. However, Rahimi and Recht (2008) obtained finitedimensional feature maps approximating the kernels \(k_{\text {RBF}}\) and \(k_{\varDelta }\) of Eq. (9). Similar results are known for other popular kernels for vectorial data like the Jaccard (Vedaldi and Zisserman 2012) and the Laplacian kernel (Andoni 2009).
Proposition 4
Proof
5.2 Fixed length walk kernels
In contrast to the classical walk based graph kernels, fixed length walk kernels take only walks up to a certain length into account. Such kernels have been successfully used in practice (Borgwardt et al. 2005; Harchaoui and Bach 2007) and are not susceptible to the phenomenon of halting (Sugiyama and Borgwardt 2015). We propose an explicit and implicit computation scheme for fixed length walk kernels supporting arbitrary vertex and edge kernels. Our implicit computation scheme is based on product graphs and benefits from sparse vertex and edge kernels. Previously no algorithms based on explicit mapping for computation of walkbased kernels have been proposed. For graphs with discrete labels, we identify the label diversity and walk lengths as key parameters affecting the running time. This is confirmed experimentally in Sect. 6.
5.2.1 Basic definitions
A fixed length walk kernel measures the similarity between graphs based on the similarity between all pairs of walks of length \(\ell \) contained in the two graphs. A walk of length \(\ell \) in a graph G is a sequence of vertices and edges \((v_0,e_1,v_1,\dots ,e_\ell ,v_\ell )\) such that \(e_i= v_{i1}v_i \in E(G)\) for \(i \in \{1, \dots , \ell \}\). We denote the set of walks of length \(\ell \) in a graph G by \(\mathcal {W}_\ell (G)\).
Definition 2
A variant of the \(\ell \)walk kernel can be obtained by considering all walks up to length \(\ell \).
Definition 3
This kernel is referred to as kstep random walk kernel by Sugiyama and Borgwardt (2015). In the following we primary focus on the \(\ell \)walk kernel, although our algorithms and results can be easily transferred to the Max\(\ell \)walk kernel.
5.2.2 Walk and convolution kernels
We show that the \(\ell \)walk kernel is p.s.d. if \(k_W\) is a valid kernel by seeing it as an instance of an Rconvolution kernel. We use this fact to develop an algorithm for explicit mapping based on the ideas presented in Sect. 4.4.
Proposition 5
The \(\ell \)walk kernel is positive semidefinite if \(k_W\) is defined according to Eq. (14) and \(k_V\) and \(k_E\) are valid kernels.
Proof
Since kernels are closed under taking linear combinations with nonnegative coefficients, see Sect. 4, we obtain the following corollary.
Corollary 4
The Max\(\ell \)walk kernel is positive semidefinite.
5.2.3 Implicit kernel computation
An essential part of the implicit computation scheme is the generation of the product graph that is then used to compute the \(\ell \)walk kernel.
Computing direct product graphs In order to support graphs with arbitrary attributes, vertex and edge kernels \(k_V\) and \(k_E\) are considered as part of the input. Product graphs can be used to represent these kernel values between pairs of vertices and edges of the input graphs in a compact manner. We avoid to create vertices and edges that would represent incompatible pairs with kernel value zero. The following definition can be considered a weighted version of the direct product graph introduced by Gärtner et al. (2003) for kernel computation.^{3}
Definition 4
Since the weighted direct product graph is undirected, we must avoid that the same pair of edges is processed twice. Therefore, we suppose that there is an arbitrary total order \(\prec \) on the vertices \(\mathcal {V}\), such that for every pair \((u,s),(v,t)\in \mathcal {V}\) either \((u,s) \prec (v,t)\) or \((v,t) \prec (u,s)\) holds. In line 8 we restrict the edge pairs that are compared to one of these cases.
Proposition 6
Let \(n=V(G)\), \(n'=V(H)\) and \(m=E(G)\), \(m'=E(H)\). Algorithm 1 computes the weighted direct product graph in time \(\mathcal {O}(n n' {\mathsf {T}}_{V} + m m' {\mathsf {T}}_{E})\), where \({\mathsf {T}}_{V}\) and \({\mathsf {T}}_{E}\) is the running time to compute vertex and edge kernels, respectively.
Note that in case of a sparse vertex kernel, which yields zero for most of the vertex pairs of the input graph, \(V(G \times _{w}H) \ll V(G) \cdot V(H)\) holds. Algorithm 1 compares two edges by \(k_E\) only in case of matching endpoints (cf. lines 7, 8), therefore in practice the running time to compare edges (line 7–13) might be considerably less than suggested by Proposition 6. We show this empirically in Sect. 6.4. In case of sparse graphs, i.e., \(E=\mathcal {O}(V)\), and vertex and edge kernels which can be computed in time \(\mathcal {O}(1)\) the running time of Algorithm 1 is \(\mathcal {O}(n^2)\), where \(n=\max \{V(G),V(H)\}\).
Counting weighted walks Given an undirected graph G with adjacency matrix \(\mathbf {A}\), let \(a^\ell _{ij}\) denote the element at (i, j) of the matrix \(\mathbf {A}^\ell \). It is wellknown that \(a^\ell _{ij}\) is the number of walks from vertex i to j of length \(\ell \). The number of \(\ell \)walks of G consequently is \(\sum _{i,j} a^\ell _{i,j} = {\mathbf {1}}^\top \mathbf {A}^\ell {\mathbf {1}} = {\mathbf {1}}^\top \mathbf {r}_\ell \), where \(\mathbf {r}_\ell = \mathbf {A} \mathbf {r}_{\ell 1}\) with \(\mathbf {r}_0 = {\mathbf {1}}\). The ith element of the recursively defined vector \(\mathbf {r}_\ell \) is the number of walks of length \(\ell \) starting at vertex i. Hence, we can compute the number of \(\ell \)walks by computing either matrix powers or matrixvector products. Note that even for sparse (connected) graphs \(\mathbf {A}^\ell \) quickly becomes dense with increasing walk length \(\ell \). The \(\ell \)th power of an \(n \times n\) matrix \(\mathbf {A}\) can be computed naïvely in time \(\mathcal {O}(n^\omega \ell )\) and \(\mathcal {O}(n^\omega \log \ell )\) using exponentiation by squaring, where \(\omega \) is the exponent of matrix multiplication. The vector \(\mathbf {r}_\ell \) can be computed by means of matrixvector multiplications, where the matrix \(\mathbf {A}\) remains unchanged over all iterations. Since direct product graphs tend to be sparse in practice, we propose a method to compute the \(\ell \)walk kernel that is inspired by matrixvector multiplication.
Theorem 1
Let \(n=\mathcal {V}\), \(m=\mathcal {E}\). Algorithm 2 computes the \(\ell \)walk kernel in time \(\mathcal {O}(n+\ell (n+m) + {\mathsf {T}}_{\mathrm{WDPG}})\), where \({\mathsf {T}}_{\mathrm{WDPG}}\) is the time to compute the weighted direct product graph.
Note that the running time depends on the size of the product graph and \(n \ll V(G) \cdot V(H)\) and \(m \ll E(G) \cdot E(H)\) is possible as discussed in Sect. 5.2.3.
5.2.4 Explicit kernel computation
We have shown in Sect. 5.2.2 that \(\ell \)walk kernels are Rconvolution kernels. Therefore, we can derive explicit feature maps with the techniques introduced in Sect. 4. Provided that we know explicit feature maps for the vertex and edge kernel, we can derive explicit feature maps for the kernel on walks and obtain an explicit computation scheme by enumerating all walks. We propose a more elaborated approach that avoids enumeration and exploits the simple composition of walks.
The dimension of the feature space and the density of feature vectors depends multiplicative on the same properties of the feature vectors of \(k_V\) and \(k_E\). For nontrivial vertex and edge kernels explicit computation of the \(\ell \)walk kernel is likely to be infeasible in practice. Therefore, we now consider graphs with simple labels from the alphabet \(\mathcal {L}\) and the kernel \(k^\delta _W\) given by Eq. (15). Following Gärtner et al. (2003) we can construct a feature map in this case, where the the features are sequences of labels associated with walks. As we will see later, this feature space is indeed obtained with Algorithm 3. A walk w of length \(\ell \) is associated with a label sequence \(\tau (w)=(\tau (v_0),\tau (e_1),\dots ,\tau (v_\ell )) \in \mathcal {L}^{2\ell +1}\). Moreover, graphs are decomposed into walks and two walks w and \(w'\) are considered equivalent if and only if \(\tau (w) = \tau (w')\). This gives rise to the feature map \(\phi ^=_\ell \), where each component is associated with a label sequence \(s \in \mathcal {L}^{2\ell +1}\) and counts the number of walks \(w \in \mathcal {W}_\ell (G)\) with \(\tau (w) = s\). Note that the obtained feature vectors have \(\mathcal {L}^{2\ell +1}\) components, but are typically sparse. In fact, Algorithm 3 constructs this feature map. We assume that \(\phi ^V(v)\) and \(\phi ^E(uv)\) have exactly one nonzero component associated with the label \(\tau (v)\) and \(\tau (uv)\), respectively. Then the single nonzero component of \(\phi ^V(u) \otimes \phi ^E(uv) \otimes \phi ^V(v)\) is associated with the label sequence \(\tau (w)\) of the walk \(w=(u, uv, v)\). A walk of length \(\ell \) can be decomposed into a walk of length \(\ell 1\) with an additional edge and vertex added at the front. This allows to obtain the number of walks of length \(\ell \) with a given label sequence starting at a fixed vertex v by concatenating \((\tau (v),\tau (vu))\) with all label sequences for walks starting from a neighbor u of the vertex v. This construction is applied in every iteration of the outer forloop in Algorithm 3 and the feature vectors \(\varPhi _i\) are easy to interpret. Each component of \(\varPhi _i(v)\) is associated with a label sequence \(s \in \mathcal {L}^{2i+1}\) and counts the walks w of length i starting at v with \(\tau (w)=s\).
We consider the running time for the case, where graphs have discrete labels and the kernel \(k^\delta _W\) is given by Eq. (15).
Theorem 2
Given a graph G with \(n=V(G)\) vertices and \(m=E(G)\) edges, Algorithm 3 computes the \(\ell \)walk kernel feature vector \(\phi ^=_\ell (G)\) in time \(\mathcal {O}(n+\ell (n+m)s)\), where s is the maximum number of different label sequences of \((\ell 1)\)walks staring at a vertex of G.
Assume Algorithm 3 is applied to unlabeled sparse graphs, i.e., \(E(G) = \mathcal {O}(V(G))\), then \(s = 1\) and the feature mapping can be performed in time \(\mathcal {O}(n+\ell n)\). This yields a total running time of \(\mathcal {O}(d \ell n + d^2)\) to compute a kernel matrix for d graphs of order n for \(\ell >0\).
5.3 Shortestpath kernel
A classical kernel applicable to attributed graphs is the shortestpath kernel (Borgwardt and Kriegel 2005). This kernel compares all shortest paths in two graphs according to their lengths and the vertex annotation of their endpoints. The kernel was proposed with an implicit computation scheme, but explicit methods of computation have been reported to be used for graphs with discrete labels.
Its computation is performed in two steps (Borgwardt and Kriegel 2005): for each graph G of the data set the complete graph \(G'\) on the vertex set V(G) is generated, where an edge uv is annotated with the length of a shortest path from u to v. The shortestpath kernel then is equivalent to the walk kernel with fixed length \(\ell =1\) between these transformed graphs, where the kernel essentially compares all pairs of edges. The kernel \(k_E\) used to compare path lengths may, for example, be realized by the Brownian Bridge kernel (Borgwardt and Kriegel 2005).
For the application to graphs with discrete labels a more efficient method of computation by explicit mapping has been reported by Shervashidze et al. (2011, Section 3.4.1). When \(k_V\) and \(k_E\) both are Dirac kernels, each component of the feature vector corresponds to a triple consisting of two vertex labels and a path length. This method of computation has been applied in several experimental comparisons, e.g., Kriege and Mutzel (2012) and Morris et al. (2016). This feature map is directly obtained from our results in Sect. 4. It is as well rediscovered from our explicit computation schemes for fixed length walk kernels reported in Sect. 5.2. However, we can also derive explicit feature maps for nontrivial kernels \(k_V\) and \(k_E\). Then the dimension of the feature map increases due to the product of kernels, cf. Eq. 21. We will study this and the effect on running time experimentally in Sect. 6.
5.4 Graphlet, subgraph and subgraph matching kernels
Subgraph or graphlet kernels have been proposed for unlabeled graphs or graphs with discrete labels (Gärtner et al. 2003; Wale et al. 2008; Shervashidze et al. 2009). The subgraph matching kernel has been developed as an extension for attributed graphs (Kriege and Mutzel 2012).
Graph kernels and their properties
Graph kernel  Parts  Dimension  Running time (implicit) 

GraphHopper  \(\mathcal {O}(n)\)  \(\delta ^2 d_V\)  \(\mathcal {O}\left( n^2(m + \log n + {\mathsf {T}}_{V} + \delta ^2 )\right) \) 
GraphInvariant  \(\mathcal {O}(n)\)  \(C d_V\)  \(\mathcal {O}\left( hm + n^2 {\mathsf {T}}_{V}\right) \) 
FixedLengthWalk  \(\mathcal {O}(\varDelta ^\ell )\)  \(d_V+(d_V d_E)^\ell \)  \(\mathcal {O}\left( \ell (n^2+m^2) + n^2{\mathsf {T}}_{V}+m^2{\mathsf {T}}_{E}\right) \) 
ShortestPath  \(\mathcal {O}(n^2)\)  \(d_V^2 d_E\)  \(\mathcal {O}\left( n^2{\mathsf {T}}_{V}+n^4{\mathsf {T}}_{E}\right) \) 
SubgraphMatching  \(\mathcal {O}(n^s)\)  \((d_V d_E^2)^s s!\)  \(\mathcal {O}\left( sn^{2s+2} + n^2{\mathsf {T}}_{V}+n^4{\mathsf {T}}_{E} \right) \) 
Indeed, the observations above are key to several graph kernels. The graphlet kernel (Shervashidze et al. 2009), also see Sect. 2, is an instance of the subgraph kernel and computed by explicit feature maps. However, only unlabeled graphs of small size are considered by the graphlet kernel, such that the canonizing function can be computed easily. The same approach was taken by Wale et al. (2008) considering larger connected subgraphs of labeled graphs derived from chemical compounds. On the contrary, for attributed graphs with continuous vertex labels, the function \(k_\simeq \) is not sufficient to compare subgraphs adequately. Therefore, subgraph matching kernels were proposed by Kriege and Mutzel (2012), which allow to specify arbitrary kernel functions to compare vertex and edge attributes. Essentially, this kernel considers all mappings between subgraphs and scores each mapping by the product of vertex and edge kernel values of the vertex and edge pairs involved in the mapping. When the specified vertex and edge kernels are Dirac kernels, the subgraph matching kernel is equal to the subgraph kernel up to a factor taking the number of automorphisms between subgraphs into account (Kriege and Mutzel 2012). Based on the above observations explicit mapping of subgraph matching kernels is likely to be more efficient when subgraphs can be adequately compared by a binary kernel.
5.5 Discussion
A crucial observation of our study of feature maps for composed kernels in Sect. 4 is that the number of components of the feature vectors increases multiplicative under taking products of kernels; this also holds in terms of nonzero components. Unless feature vectors have few nonzero components, this operation is likely to be prohibitive in practice. However, if feature vectors have exactly one nonzero component like those associated with binary kernels, taking products of kernels is manageable by sparse data structures.
In this section we have systematically constructed and discussed feature maps of several graph kernels and the observation mentioned above is expected to affect the kernels to varying extents. While weighted vertex kernels do not take products of vertex and edge kernels, the shortestpath kernel and, in particular, the subgraph matching and fixed length walk kernels heavily rely on multiplicative combinations. Considering the relevant special case of a Dirac kernel, which leads to feature vectors with only one nonzero component, the rapid growth due to multiplication is tamed. In this case the number of substructures considered as different according to the vertex and edge kernels determines the number of nonzero components of the feature vectors associated with the graph kernel. The basic characteristics of the considered graph kernels are summarized in Table 2. The sparsity of the feature vectors of the vertex and edge kernels is an important intervening factor, which is difficult to assess theoretically and we proceed by an experimental study.
6 Experimental evaluation
 Q1
Are approximative explicit feature maps of kernels for attributed graphs competitive in terms of running time and classification accuracy compared to exact implicit computation?
 Q2Are exact explicit feature maps competitive for kernels relying on multiplication when the Dirac kernel is used to compare discrete labels? How do the graph properties such as label diversity affect the running time?
 (a)
How does the fixed length walk kernel behave with regard to these questions and what influence does the walk length have?
 (b)
Can the same behavior regarding running time be observed for the graphlet and subgraph matching kernel?
 (a)
6.1 Experimental setup
All algorithms were implemented in Java and the default Java HashMap class was used to store feature vectors. Due to the varied memory requirements of the individual series of experiments, different hardware platforms were used in Sects. 6.2, 6.4 and 6.5. In order to compare the running time of both computational strategies systematically without the dependence on one specific kernel method, we report the running time to compute the quadratic kernel matrices, unless stated otherwise. We performed classification experiments using the CSVM implementation LIBSVM (Chang and Lin 2011). We report mean prediction accuracies obtained by 10fold crossvalidation repeated 10 times with random fold assignments. Within each fold all necessary parameters were selected by crossvalidation based on the training set. This includes the regularization parameter C selected from \(\{10^{3}, 10^{2}, \dots , 10^3\}\), all kernel parameters, where applicable, and whether to normalize the kernel matrix.
6.1.1 Data sets
Data set statistics and properties
Data set  Properties  

Graphs  Classes  Avg. V  Avg. E  Vertex/edge labels  Attributes  
Mutag  188  2  17.9  19.8  \(+\)/\(+\)  − 
U251  3 755  2  23.1  24.8  \(+\)/\(+\)  − 
Enzymes  600  6  32.6  62.1  \(+\)/\(+\)  18 
Proteins  1 113  2  39.1  72.8  \(+\)/−  1 
SyntheticNew  300  2  100.0  196.3  −/−  1 
Synthie  400  4  95.0  172.9  −/−  15 
Small molecules Molecules can naturally be represented by graphs, where vertices represent atoms and edges represent chemical bonds. Mutag is a data set of chemical compounds divided into two classes according to their mutagenic effect on a bacterium. This small data set is commonly used in the graph kernel literature. In addition we considered the larger data set U251, which stems from the NCI Open Database provided by the National Cancer Institute (NCI). In this data set the class labels indicate the ability of a compound to inhibit the growth of the tumor cell line U251. We used the data set processed by Swamidass et al. (2005), which is publicly available from the ChemDB website.^{5}
MacromoleculesEnzymes and Proteins both represent macromolecular structures and were obtained from Borgwardt et al. (2005) and Feragen et al. (2013). The following graph model has been employed. Vertices represent secondary structure elements (SSE) and are annotated by their type, i.e., helix, sheet or turn, and a rich set of physical and chemical attributes. Two vertices are connected by an edge if they are neighbors along the amino acid sequence or one of three nearest neighbors in space. Edges are annotated with their type, i.e., structural or sequential. In Enzymes each graph is annotated by an EC top level class, which reflects the chemical reaction the enzyme catalyzes, Proteins is divided into enzymes and nonenzymes.
Synthetic graphs The data sets SyntheticNew and Synthie were synthetically generated to obtain classification benchmarks for graph kernels with attributes. We refer the reader to the publications Feragen et al. (2013)^{6} and Morris et al. (2016), respectively, for the details of the generation process. Additionally, we generated new synthetic graphs in order to systematically vary graph properties of interest like the label diversity, which we expect to have an effect on the running time according to our theoretical analysis.
6.2 Approximative explicit feature maps of kernels for attributed graphs (Q1)
We have derived explicit computation schemes of kernels for attributed graphs, which have been proposed with an implicit method of computation. Approximative explicit computation is possible under the assumption that the kernel for the attributes can be approximated by explicit feature maps. We compare both methods of computation w.r.t. their running time and the obtained classification accuracy on the four attributed graph data sets Enzymes, Proteins, SyntheticNew and Synthie. Since the discrete labels alone are often highly informative, we ignored discrete labels if present and considered the realvalued vertex annotations only in order to obtain challenging classification problems. All attributes where dimensionwise linearly scaled to the range [0, 1] in a preprocessing step.
6.2.1 Method
We employed three kernels for attributed graphs: the shortestpath kernel, cf. Sect. 5.3, the GraphHopper and GraphInvariant kernel as described in Sect. 5.1.1. Preliminary experiments with the subgraph matching kernel showed that it cannot be computed by explicit feature maps for nontrivial subgraph sizes due to its high memory consumption. The same holds for fixed length walk kernels with walk length \(\ell \ge 3\). Therefore, we did not consider these kernels any further regarding Q1, but investigate them for graphs with discrete labels in Sects. 6.4 and 6.5 to answer Q2a and Q2b.
For the shortestpath kernel we used the Dirac kernel to compare path lengths and selected the number of WeisfeilerLehman refinement steps for the GraphInvariant kernel from \(h \in \{0, \dots , 7\}\). For the comparison of attributes we employed the dimensionwise product of the hat kernel \(k_\varDelta \) as defined in Eq. (9) choosing \(\delta \) from \(\{0.2, 0.4, 0.6, 0.8, 1.0, 1.5, 2.0\}\). The three kernels were computed functionally employing this kernel as a base line. We obtained approximate explicit feature maps for the attribute kernel by the method of Rahimi and Recht (2008) and used these to derive approximate explicit feature maps for the graph kernels. We varied the number of random binning features from \(\{1, 2, 4, 8, 16, 32, 64\}\), which corresponds to the number of nonzero components of the feature vectors for the attribute kernel and controls its approximation quality. Please note that the running time is effected by the kernel parameters, i.e., \(\delta \) of Eq. (9) and the number h of WeisfeilerLehman refinement steps for GraphInvariant. Therefore, in the following we report the running times for fixed values \(\delta =1\) and \(h=3\), which were selected frequently by crossvalidation.
6.2.2 Results and discussion
We were not able to compute the shortestpath kernel by explicit feature maps with more than 16 iterations of binning for the base kernel on Enzymes and Proteins and no more than 4 iterations on Synthie and SyntheticNew with 64 GB of main memory. The high memory consumption of this kernel is in accordance with our theoretical analysis, since the multiplication of vertex and edge kernels drastically increases the number of nonzero components of the feature vectors. This problem does not effect the two weighted vertex kernels to the same extent. We observed the general trend that the memory consumption and running time increases with small values of \(\delta \). This is explained by the fact that the number of components of the feature vectors of the vertex kernels increases in this case. Although the number of nonzero components does not increase for these feature vectors, it does for the graph kernel feature vectors, since the number of vertices with attributes falling into different bins increases.
The results on running time and accuracy are summarized in Fig. 2. For the two data sets Enzymes and Synthie we observe that the classification accuracy obtained by the approximate explicit feature maps approaches the accuracy obtained by the exact method with increasing number of binning iterations. For the other two data sets the accuracy does not improve with the number of iterations. For Proteins the kernels obtained with a single iteration of binning, i.e., essentially applying a Dirac kernel, achieve an accuracy at the same level as the exact kernel obtained by implicit computation. This suggests that for this data set a trivial comparison of attributes is sufficient or that the attributes are not essential for classification at all. For SyntheticNew the kernels using a single iteration of binning are even better than the exact kernel, but get worse as the number of iterations increases. One possible explanation for this is that the vertex kernel used is not a good choice for this data set.
With few iteration of binning the explicit computation scheme is always faster than the implicit computation. The growth in running time with increasing number of binning iterations for the vertex kernel varies between the graph kernels. Approximating the GraphHopper kernel by explicit feature maps with 64 binning iteration for the vertex kernel leads to a running time similar to the one required for its exact implicit computation on all data sets with exception of SyntheticNew. On this data set explicit computation remains faster. For GraphInvariant explicit feature maps lead to a running time which is orders of magnitude lower than implicit computation. Although both, GraphHopper and GraphInvariant are weighted vertex kernels, this difference can be explained by the number of nonzero components in the feature vectors of the weight kernel. We observe that GraphInvariant clearly provides the best classification accuracy for two of the four data sets and is competitive for the other two. At the same time GraphInvariant can be approximated very efficiently by explicit feature maps. Therefore, even for attributed graphs effective and efficient graph kernels can be obtained from explicit feature maps by our approach.
6.3 Kernels for graphs with discrete labels (Q2)
In order to get a first impression of the runtime behavior of explicit and implicit computation schemes on graphs with discrete labels, we computed the kernel matrix for the standard data sets ignoring the attributes, if present. The experiments were conducted using Java OpenJDK v1.7.0 on an Intel Core i73770 CPU at 3.4 GHz (Turbo Boost disabled) with 16 GB of RAM using a single processor only. The reported running times are average values over 5 runs.
The results are summarized in Table 4. For the shortestpath kernel explicit mapping clearly outperforms implicit computation by several orders of magnitude with respect to running time. This is in accordance with our theoretical analysis and our results suggest to always use explicit computation schemes for this kernel whenever a Dirac kernel is adequate for label and path length comparison. In this case memory consumption is unproblematic, in contrast to the setting discussed in Sect. 6.2.
Running times of the fixed length walk kernel (\(\hbox {FLRW}_\ell \)) with walk lengths \(\ell \), the shortestpath kernel (SP), connected subgraph matching kernel (CSM) and the graphlet kernel (GL) on graphs with discrete labels in seconds unless stated otherwise
Kernel  Mutag  U251  Enzymes  Proteins  SyntheticNew  Synthie 

Implicit  
\(\hbox {FLRW}_0\)  0.618  250.3  20.67  100.8  159.5  224.2 
\(\hbox {FLRW}_1\)  0.606  281.6  23.45  116.4  202.3  284.1 
\(\hbox {FLRW}_2\)  0.652  303.4  26.03  132.0  236.2  330.7 
\(\hbox {FLRW}_3\)  0.617  323.8  28.18  143.0  270.4  377.1 
\(\hbox {FLRW}_4\)  0.653  343.8  30.63  156.2  304.3  424.2 
\(\hbox {FLRW}_5\)  0.693  363.7  32.65  169.5  336.9  468.6 
\(\hbox {FLRW}_6\)  0.733  383.5  34.86  182.5  371.2  513.6 
\(\hbox {FLRW}_7\)  0.779  404.4  36.94  195.8  404.4  558.9 
\(\hbox {FLRW}_8\)  0.870  425.3  38.16  208.8  438.0  603.9 
\(\hbox {FLRW}_9\)  0.877  447.7  39.97  221.3  470.1  648.1 
SP  5.272  2 h 6\(^{\prime }\)55\(^{\prime \prime }\)  9\(^{\prime }\)8\(^{\prime \prime }\)  4 h 58\(^{\prime }\)40\(^{\prime \prime }\)  2h 14\(^{\prime }\)23\(^{\prime \prime }\)  2 h 39\(^{\prime }\)30\(^{\prime \prime }\) 
CSM  15.45  4 h 0\(^{\prime }\)16\(^{\prime \prime }\)  34\(^{\prime }\)15\(^{\prime \prime }\)  OOM  \(>\,\)24 h  \(>\,\)24 h 
Explicit  
\(\hbox {FLRW}_0\)  0.004  0.868  0.029  0.081  0.008  0.016 
\(\hbox {FLRW}_1\)  0.014  2.827  0.080  0.141  0.024  0.032 
\(\hbox {FLRW}_2\)  0.019  7.844  0.170  0.251  0.040  0.051 
\(\hbox {FLRW}_3\)  0.035  14.96  0.466  0.545  0.056  0.070 
\(\hbox {FLRW}_4\)  0.058  31.73  1.518  1.207  0.072  0.092 
\(\hbox {FLRW}_5\)  0.147  64.57  4.629  2.991  0.089  0.110 
\(\hbox {FLRW}_6\)  0.461  107.8  13.58  6.476  0.104  0.128 
\(\hbox {FLRW}_7\)  1.127  170.9  37.72  12.07  0.124  0.150 
\(\hbox {FLRW}_8\)  2.491  346.0  95.03  24.48  0.141  0.172 
\(\hbox {FLRW}_9\)  4.809  646.8  278.4  56.62  0.161  0.192 
SP  0.120  27.82  0.907  3.121  1.332  1.459 
GL  0.011  3.512  0.205  0.354  0.186  0.310 
6.4 Fixed length walk kernels for graphs with discrete labels (Q2a)
Our comparison in Sect. 6.2 showed that computation by explicit feature maps becomes prohibitive when vertex and edge kernels with feature vectors having multiple nonzero components are multiplied. This is even observed for the shortestpath kernel, which applies a walk kernel of fixed length one. Therefore, we study the implicit and explicit computation schemes of the fixed length walk kernel on graphs with discrete labels, which are compared by the Dirac kernel, cf. Eq. (15). Since both computation schemes produce the same kernel matrices, our main focus in this section is on running times.
 (i)
implicit computation benefits from sparse vertex and edge kernels,
 (ii)
explicit computation is promising for graphs with a uniform label structure, which exhibit few different features, and then scales to large data sets.
6.4.1 Synthetic data sets
The results are depicted in Fig. 3, where a label diversity of 50 means that \(p_V=0.5\). Figure 3a shows that the running time for implicit computation increases with the data set size and decreases with the label diversity. This observation is in accordance with our hypotheses. When the label diversity increases, there are less compatible pairs of vertices and the weighted direct product graph becomes smaller. Consequently, its computation and the counting of weighted walks require less running time. For explicit computation we observe a different trend: while the running time increases with the size of the data set, the approach is extremely efficient for graphs with uniform labels (\(p_V=0\)) and becomes slower when the label diversity increases, cf. Fig. 3b. Combining both results, cf. Fig. 3c, shows that both approaches yield the same running time for a label diversity of \(p_V \approx 0.3\), while for higher values of \(p_V\) implicit computation is preferable and explicit otherwise.
6.4.2 Molecular data sets
Figure 4a shows that the running time of the implicit computation scheme heavily depends on the size of the data set. The increase with the walk length is less considerable. This can be explained by the time \({\mathsf {T}}_{\mathrm{WDPG}}\) required to compute the product graph, which is always needed independent of the walk length. For short walks explicit computation is very efficient, even for larger data sets, cf. Fig. 4b. However, when a certain walk length is reached the running time increases drastically. This can be explained by the growing number of different label sequences. Notably for walks of length 8 and 9 the running time also largely increases with the data set size. This indicates that the time \({\mathsf {T}}_{\text {dot}}\) has a considerable influence on the running time. In the following section we analyze the running time of the different procedures for the two algorithms in more detail. Figure 4c shows that for walk length up to 7 explicit computation beats implicit computation on the molecular data set.
6.4.3 Enzymes and Mutag
Figure 5 shows the running time of both algorithms depending on the walk length and gives the time for product graph computation and explicit mapping, respectively. In addition, the prediction accuracy is presented. For both data sets we observe that up to a walk length of 7 explicit mapping is more efficient. Notably a peak of the accuracy is reached for walk length smaller than 7 in both cases. For the Mutag data set walks of length 3 provide the best results and walks of length 6 for the Enzymes data set, i.e., in both cases explicit mapping should be preferred when computing a walk kernel of fixed length. The running time of the product graph computation is constant and does not depend on the walk length. For explicit mapping the time required to compute the dot product becomes dominating when the walk length is increased. This can be explained by the fact that the generation of the kernel matrix involves a quadratic number of dot product computations. Note that the given times include a quadratic number of product graph computations while the times for generating the feature vectors include only a linear number of operations.
As a side note, we also compared the accuracy of the fixed length walk kernels to the accuracy reached by the geometric random walk kernel (GRW) according to Gärtner et al. (2003), which considers arbitrary walk lengths. The parameter \(\gamma \) of the geometric random walk kernel was selected by crossvalidation from \(\{10^{5}, 10^{4},\dots ,10^{2}\}\). We observed that the accuracy of the fixed length walk kernel is competitive on the Mutag data set (GRW 87.3), and considerably better on the Enzymes data set (GRW 31.6), cf. Fig. 5. This is remarkable, since the fixed length walk kernel yields best results with walk length 6, for which it is efficiently computed by explicit mapping. However, this is not possible for the geometric random walk kernel. For a more detailed discussion and comparison between fixed length walk kernels and the geometric random walk kernel we refer the reader to Sugiyama and Borgwardt (2015), which appeared after the conference publication (Kriege et al. 2014).
6.4.4 WeisfeilerLehman label refinement
If no refinement is applied, the explicit mapping approach beats the product graph based algorithm for the used walk lengths. However, as soon as a single iteration of label refinement is performed, the product graph based algorithm becomes competitive for walk length 0 and 1 and outperforms the explicit mapping approach for higher walk lengths. The running times do not change substantially for more iterations of refinement. This indicates that a single iteration of WeisfeilerLehman refinement results in a high label diversity that does not increase considerably for more iterations on the Enzymes data set. When using our walkbased kernel as base kernel of a WeisfeilerLehman graph kernel (Shervashidze et al. 2011), our observation suggests to start with explicit computation and switch to the implicit computation scheme after few iterations of refinement.
6.5 Subgraph kernels for graphs with discrete labels (Q2b)
In this section we experimentally compare the running time of the subgraph matching and the subgraph (or graphlet) kernel as discussed in Sect. 5.4. The explicit computation scheme, which is possible for graphs with discrete labels compared by the Dirac kernel, is expected to be favorable.
6.5.1 Method
We have reimplemented a variation of the graphlet kernel taking connected induced subgraphs with three vertices and discrete vertex and edge labels into account. The only possible features are triangles and paths of length two. Graph canonization is realized by selecting the lexicographically smallest string obtained by traversing the graph and concatenating the observed labels. Our implementation is similar to the approach used by Shervashidze et al. (2011) as extension of the original graphlet kernel (Shervashidze et al. 2009) to the domain of labeled graphs. We refer to this method as graphlet kernel in the following. We compared the graphlet kernel to the connected subgraph matching kernel taking only connected subgraphs on three vertices into account. In order not to penalize the running time of the connected subgraph matching kernel by additional automorphism computations, the weight function does not consider the number of automorphisms (Kriege and Mutzel 2012, Theorem 2) and, consequently, not the same kernel values are computed.
The experiments were conducted using Sun Java JDK v1.6.0 on an Intel Xeon E5430 machine at 2.66 GHz with 8 GB of RAM using a single processor only. The reported running times are average values over 5 runs.
6.5.2 Results and discussion
Figure 7 shows a computational phase transition: for the synthetic data set the subgraph matching kernel is more efficient than the graphlet kernel for instances with 20–30 different labels and its running time increases drastically when the number of labels decreases. The graphlet kernel in contrast is more efficient for graphs with uniform or few labels. For more than 10 different labels, there is only a moderate increase in running time. This can be explained by the fact that the number of features contained in the graphs does not increase considerably as soon as a certain number of different labels is reached. The enumeration of triangles dominates the running time for this relatively dense synthetic data set. The running time behavior of the subgraph matching kernel is as expected and is directly related to the size and number of edges in the weighted association graph.
Our synthetic data set differs from typical realworld instances, since we generated dense graphs with many different labels, which are assigned uniformly at random. For realworld data sets the graphlet kernel consistently outperforms the subgraph matching kernel by orders of magnitude. It would be interesting to further investigate where this computational phase transition occurs for larger subgraphs and to analyze if the implicit computation scheme then becomes competitive for instances of practical relevance. This requires the implementation of nontrivial graph canonization algorithms and remains future work. The results we obtained clearly suggest to prefer the explicit computation schemes when no flexible scoring by vertex and edge kernels is required.
7 Conclusion
The breadth of problems requiring to deal with graph data is growing rapidly and graph kernels have become an efficient and widely used method for measuring similarity between graphs. Highly scalable graph kernels have recently been proposed for graphs with thousands and millions of vertices based on explicit graph feature maps. Implicit computation schemes are used for kernels with a large number of possible features such as walks and when graphs are annotated by continuous attributes.
To set the stage for the experimental comparison, we actually made several contributions to the theory and algorithmics of graph kernels. We presented a unified view on implicit and explicit graph features. More precisely, we derived explicit feature maps from the implicit feature space of convolution kernels and analyzed the circumstances rendering this approach feasible in practice. Using these results, we developed explicit computation schemes for random walk kernels (Gärtner et al. 2003; Vishwanathan et al. 2010), subgraph matching kernels (Kriege and Mutzel 2012), and shortestpath kernels (Borgwardt and Kriegel 2005). Moreover, we introduced weighted vertex kernels and derived explicit feature maps. As a result of this we obtained approximate feature maps for stateoftheart kernels for graphs with continuous attributes such as the GraphHopper kernel (Feragen et al. 2013). For fixed length walk kernels we have developed implicit and explicit computation schemes and analyzed their running time. Our theoretical results have been confirmed experimentally by observing a computational phase transition with respect to label diversity and walk lengths.
We have shown that kernels composed by multiplication of nontrivial base kernels may lead to a rapid growth of the number of nonzero components in the feature vectors, which renders explicit computation infeasible. One approach to alleviate this in future work is to introduce sampling or hashing to obtain compact feature representations in such cases, e.g., following the work by Shi et al. (2009).
Footnotes
 1.
Note that we may consider every kernel \(\kappa _i\) on \(\mathcal {R}_i\) as kernel \(\kappa '_i\) on \(\mathcal {R}\) by defining \(\kappa '_i(x,y)=\kappa _i(x_i,y_i)\).
 2.
The same idea to compare walks was proposed by Kashima et al. (2003) as part of the marginalized kernel between labeled graphs.
 3.
Note that we consider undirected graphs while Gärtner et al. (2003) refers to directed graphs.
 4.
We assume \(r_i(u,u')=0\) for \((u,u') \notin V(G \times _{w}H)\) .
 5.
 6.
We used the updated version of the data set Synthetic published together with the Erratum to Feragen et al. (2013).
Notes
References
 Andoni A (2009) Nearest neighbor search: the old, the new, and the impossible. Ph.D. thesis, MITGoogle Scholar
 Bai L, Rossi L, Zhang Z, Hancock ER (2015) An aligned subtree kernel for weighted graphs. In: Proceedings of the thirtysecond international conference on machine learning, pp 30–39Google Scholar
 Borgwardt KM, Kriegel HP (2005) Shortestpath kernels on graphs. In: Proceedings of the fifth IEEE international conference on data mining, pp 74–81Google Scholar
 Borgwardt KM, Ong CS, Schönauer S, Vishwanathan SVN, Smola AJ, Kriegel HP (2005) Protein function prediction via graph kernels. Bioinformatics 21(Suppl 1):i47–i56CrossRefGoogle Scholar
 Bruna J, Zaremba W, Szlam A, LeCun Y (2014) Spectral networks and deep locally connected networks on graphs. In: International conference on learning representationsGoogle Scholar
 Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm CrossRefGoogle Scholar
 Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems, pp 3844–3852Google Scholar
 Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, AspuruGuzik A, Adams RP (2015) Convolutional networks on graphs for learning molecular fingerprints. In: Advances in neural information processing systems, pp 2224–2232Google Scholar
 Feragen A, Kasenburg N, Petersen J, Bruijne MD, Borgwardt K (2013) Scalable kernels for graphs with continuous attributes. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems, pp 216–224. Erratum available at http://image.diku.dk/aasa/papers/graphkernels_nips_erratum.pdf
 Fout A, Byrd J, Shariat B, BenHur A (2017) Protein interface prediction using graph convolutional networks. In: Advances in neural information processing systems, pp 6533–6542Google Scholar
 Fröhlich H, Wegner JK, Sieker F, Zell A (2005) Optimal assignment kernels for attributed molecular graphs. In: Proceedings of the 22nd international conference on machine learning. ACM, New York, NY, USA, ICML ’05, pp 225–232Google Scholar
 Gärtner T, Flach P, Wrobel S (2003) On graph kernels: hardness results and efficient alternatives. In: Learning theory and kernel machines, Lecture Notes in Computer Science, vol 2777. Springer, pp 129–143Google Scholar
 Ghosh S, Das N, Gonçalves T, Quaresma P, Kundu M (2018) The journey of graph kernels through two decades. Comput Sci Rev 27:88–111MathSciNetCrossRefGoogle Scholar
 Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. In: 33rd International conference on machine learningGoogle Scholar
 Hamilton WL, Ying R, Leskovec J (2017a) Inductive representation learning on large graphs. In: Advances in neural information processing systems, pp 1025–1035Google Scholar
 Hamilton WL, Ying R, Leskovec J (2017b) Representation learning on graphs: methods and applications. IEEE Data Eng Bull 40(3):52–74Google Scholar
 Harchaoui Z, Bach F (2007) Image classification with segmentation graph kernels. In: IEEE conference on computer vision and pattern recognitionGoogle Scholar
 Haussler D (1999) Convolution kernels on discrete structures. Tech. Rep. UCSCCRL9910, University of California, Santa Cruz, CA, USAGoogle Scholar
 Hido S, Kashima H (2009) A lineartime graph kernel. In: The ninth IEEE international conference on data mining, pp 179–188Google Scholar
 Horváth T, Gärtner T, Wrobel S (2004) Cyclic pattern kernels for predictive graph mining. In: Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, pp 158–167Google Scholar
 Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, pp 217–226Google Scholar
 Johansson FD, Dubhashi D (2015) Learning with similarity functions on graphs using matchings of geometric embeddings. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, KDD ’15, pp 467–476Google Scholar
 Kang U, Tong H, Sun J (2012) Fast random walk graph kernel. In: Proceedings of the 2012 SIAM international conference on data mining, pp 828–838Google Scholar
 Kar P, Karnick H (2012) Random feature maps for dot product kernels. In: Proceedings of the fifteenth international conference on artificial intelligence and statistics, AISTATS 2012, La Palma, Canary Islands, April 21–23, 2012, pp 583–591Google Scholar
 Kashima H, Tsuda K, Inokuchi A (2003) Marginalized kernels between labeled graphs. In: Proceedings of the twentieth international conference on machine learning, pp 321–328Google Scholar
 Kersting K, Kriege NM, Morris C, Mutzel P, Neumann M (2016) Benchmark data sets for graph kernels. http://graphkernels.cs.tudortmund.de
 Kipf TN, Welling M (2017) Semisupervised classification with graph convolutional networks. In: International conference on learning representationsGoogle Scholar
 Kondor R, Pan H (2016) The multiscale Laplacian graph kernel. In: Advances in neural information processing systems, pp 2982–2990Google Scholar
 Kriege N, Mutzel P (2012) Subgraph matching kernels for attributed graphs. In: Proceedings of the 29th international conference on machine learning. http://www.icml.cc/Omnipress
 Kriege N, Neumann M, Kersting K, Mutzel M (2014) Explicit versus implicit graph feature maps: a computational phase transition for walk kernels. In: 2014 IEEE international conference on data mining, pp 881–886Google Scholar
 Kriege NM, Giscard PL, Wilson R (2016) On valid optimal assignment kernels and applications to graph classification. In: Advances in neural information processing systems. Curran Associates, Inc., pp 1623–1631Google Scholar
 Kriege NM, Johansson FD, Morris C (2019) A survey on graph kernels. CoRR. arXiv:1903.11835
 Mahé P, Ueda N, Akutsu T, Perret JL, Vert JP (2004) Extensions of marginalized graph kernels. In: Proceedings of the twentyfirst international conference on machine learning, p 70Google Scholar
 Martino GDS, Navarin N, Sperduti A (2012) A treebased kernel for graphs. In: Proceedings of the 2012 SIAM international conference on data mining. SIAM/Omnipress, pp 975–986Google Scholar
 Martino GDS, Navarin N, Sperduti A (2018) Treebased kernel for graphs with continuous attributes. IEEE Trans Neural Netw Learn Syst 29(7):3270–3276MathSciNetGoogle Scholar
 Merkwirth C, Lengauer T (2005) Automatic generation of complementary descriptors with molecular graph networks. J Chem Inf Model 45(5):1159–1168CrossRefGoogle Scholar
 Morris C, Kriege NM, Kersting K, Mutzel P (2016) Faster kernels for graphs with continuous attributes via hashing. In: Bonchi F, DomingoFerrer J (eds) IEEE international conference on data mining (ICDM)Google Scholar
 Narayanan A, Chandramohan M, Chen L, Liu Y, Saminathan S (2016) subgraph2vec: learning distributed representations of rooted subgraphs from large graphs. In: Workshop on mining and learning with graphs. arXiv:1606.08928
 Neumann M, Garnett R, Bauckhage C, Kersting K (2016) Propagation kernels: efficient graph kernels from propagated information. Mach Learn 102(2):209–245MathSciNetCrossRefGoogle Scholar
 Nikolentzos G, Meladianos P, Vazirgiannis M (2017) Matching node embeddings for graph similarity. In: AAAI. AAAI Press, pp 2429–2435Google Scholar
 Nikolentzos G, Meladianos P, Limnios S, Vazirgiannis M (2018) A degeneracy framework for graph similarity. In: IJCAI, pp 2595–2601. http://www.ijcai.org
 Orsini F, Frasconi P, De Raedt L (2015) Graph invariant kernels. In: Proceedings of the twentyfourth international joint conference on artificial intelligence, pp 3756–3762Google Scholar
 Pham N, Pagh R (2013) Fast and scalable polynomial kernels via explicit feature maps. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, KDD ’13, pp 239–247. https://doi.org/10.1145/2487575.2487591
 Rahimi A, Recht B (2008) Random features for largescale kernel machines. In: Advances in neural information processing systems, pp 1177–1184Google Scholar
 Ramon J, Gärtner T (2003) Expressivity versus efficiency of graph kernels. In: First international workshop on mining graphs, trees and sequencesGoogle Scholar
 Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. Trans Neural Netw 20(1):61–80CrossRefGoogle Scholar
 Schiavinato M, Gasparetto A, Torsello A (2015) Transitive assignment kernels for structural classification. In: Feragen A, Pelillo M, Loog M (eds) Similaritybased pattern recognition: third international workshop, SIMBAD 2015, Copenhagen, Denmark, October 12–14, 2015. Springer International Publishing, Cham, pp 146–159CrossRefGoogle Scholar
 Schütt K, Kindermans PJ, Sauceda HE, Chmiela S, Tkatchenko A, Müller KR (2017) SchNet: a continuousfilter convolutional neural network for modeling quantum interactions. In: Advances in neural information processing systems, pp 992–1002Google Scholar
 ShaweTaylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, New YorkCrossRefGoogle Scholar
 Shervashidze N, Borgwardt K (2009) Fast subtree kernels on graphs. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22, pp 1660–1668Google Scholar
 Shervashidze N, Vishwanathan S, Petri TH, Mehlhorn K, Borgwardt KM (2009) Efficient graphlet kernels for large graph comparison. In: 12th International conference on artificial intelligence and statisticsGoogle Scholar
 Shervashidze N, Schweitzer P, van Leeuwen EJ, Mehlhorn K, Borgwardt KM (2011) WeisfeilerLehman graph kernels. J Mach Learn Res 12:2539–2561MathSciNetzbMATHGoogle Scholar
 Shi Q, Petterson J, Dror G, Langford J, Smola A, Vishwanathan S (2009) Hash kernels for structured data. J Mach Learn Res 10:2615–2637MathSciNetzbMATHGoogle Scholar
 Shin K, Kuboyama T (2010) A generalization of Haussler’s convolution kernel—mapping kernel and its application to tree kernels. J Comput Sci Technol 25:1040–1054MathSciNetCrossRefGoogle Scholar
 Su Y, Han F, Harang RE, Yan X (2016) A fast kernel for attributed graphs. In: Proceedings of the 2016 SIAM international conference on data miningGoogle Scholar
 Sugiyama M, Borgwardt KM (2015) Halting in random walk kernels. In: Advances in neural information processing systems, pp 1630–1638Google Scholar
 Swamidass SJ, Chen J, Bruand J, Phung P, Ralaivola L, Baldi P (2005) Kernels for small molecules and the prediction of mutagenicity, toxicity and anticancer activity. Bioinformatics 21(Suppl 1):i359–i368. https://doi.org/10.1093/bioinformatics/bti1055 CrossRefGoogle Scholar
 Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. IEEE Trans Pattern Anal Mach Intell 34(3):480–492CrossRefGoogle Scholar
 Vert JP (2008) The optimal assignment kernel is not positive definite. CoRR abs/0801.4061Google Scholar
 Vishwanathan SVN, Schraudolph NN, Kondor RI, Borgwardt KM (2010) Graph kernels. J Mach Learn Res 11:1201–1242MathSciNetzbMATHGoogle Scholar
 Wale N, Watson IA, Karypis G (2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl Inf Syst 14(3):347–375CrossRefGoogle Scholar
 Yanardag P, Vishwanathan SVN (2015) Deep graph kernels. In: 21st ACM SIGKDD international conference on knowledge discovery and data mining, pp 1365–1374Google Scholar
 Ying R, He R, Chen K, Eksombatchai P, Hamilton WL, Leskovec J (2018a) Graph convolutional neural networks for webscale recommender systems. In: ACM SIGKDD international conference on knowledge discovery & data miningGoogle Scholar
 Ying R, You J, Morris C, Ren X, Hamilton WL, Leskovec J (2018b) Hierarchical graph representation learning with differentiable pooling. In: Advances in neural information processing systemsGoogle Scholar
 Zhang M, Cui Z, Neumann M, Chen Y (2018a) An endtoend deep learning architecture for graph classification. In: AAAI conference on artificial intelligenceGoogle Scholar
 Zhang Y, Wang L, Wang L (2018b) A comprehensive evaluation of graph kernels for unattributed graphs. Entropy 20(12):984CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.