Abstract
Graphs are versatile tools for representing structured data. As a result, a variety of machine learning methods have been studied for graph data analysis. Although many such learning methods depend on the measurement of differences between input graphs, defining an appropriate distance metric for graphs remains a controversial issue. Hence, we propose a supervised distance metric learning method for the graph classification problem. Our method, named interpretable graph metric learning (IGML), learns discriminative metrics in a subgraphbased feature space, which has a strong graph representation capability. By introducing a sparsityinducing penalty on the weight of each subgraph, IGML can identify a small number of important subgraphs that can provide insight into the given classification task. Because our formulation has a large number of optimization variables, an efficient algorithm that uses pruning techniques based on safe screening and working set selection methods is also proposed. An important property of IGML is that solution optimality is guaranteed because the problem is formulated as a convex problem and our pruning strategies only discard unnecessary subgraphs. Furthermore, we show that IGML is also applicable to other structured data such as itemset and sequence data, and that it can incorporate vertexlabel similarity by using a transportationbased subgraph feature. We empirically evaluate the computational efficiency and classification performance of IGML on several benchmark datasets and provide some illustrative examples of how IGML identifies important subgraphs from a given graph dataset.
Introduction
Because of the growing diversity of data science applications, machine learning methods must adapt to a variety of complicated structured data, from which it is often difficult to obtain typical numerical vector representations of input objects. A standard approach to modeling structured data is to employ graphs. For example, graphbased representations are prevalent in domains such as chemo and bio informatics. In this study, we particularly focus on the case in which a data instance is represented as a pair of a graph and its associated class label.
Although numerous machine learning methods explicitly or implicitly depend on how to measure differences between input objects, defining an appropriate distance metric on graphs remains a controversial issue in the field. A widely accepted approach is the graph kernel (Gärtner et al. 2003; Vishwanathan et al. 2010), which enables to apply machine learning methods to graph data without requiring explicit vector representations. Another popular approach would be to use neural networks (Atwood and Towsley 2016; Narayanan et al. 2017), from which a suitable representation can be learned while avoiding to explicitly define a metric. However, in these approaches, it is difficult to create a metric that explicitly extracts significant substructures, i.e., subgraphs. Identifying discriminative subgraphs in an interpretable manner can be insightful for many graph classification tasks. In particular, graph representation is prevalent in scientific data analysis. For example, chemical compounds are often represented by graphs; thus, finding subgraphs that have a strong effect on a target label (e.g., toxicity) is informative. Other examples of graph representations are protein 3D structures and crystalline substances (e.g., Brinda and Vishveshwara 2005; Xie and Grossman 2018), where the automatic identification of important substructures is expected to provide an insight behind correlation between structures and target labels. Further details of the previous studies are discussed in Sect. 2.
We propose a supervised method that obtains a metric for graphs, thereby achieving both high predictive performance and interpretability. Our method, named interpretable graph metric learning (IGML), combines the concept of metric learning (e.g., Weinberger and Saul 2009; Davis et al. 2007) with a subgraph representation, where each graph is represented by a set of its subgraphs. IGML optimizes a metric that assigns a weight \(m_{i(H)} \ge 0\) to each subgraph H contained in a given graph G. Let \(\phi _H(G)\) be a feature of the graph G that is monotonically nondecreasing with respect to the frequency of subgraph H of G.
Note that we assume that subgraphs are counted without overlapped vertices and edges throughout the study. We consider the following squared distance between two graphs G and \(G'\):
where \({{\mathcal {G}}}\) is the set of all connected graphs. Although it is known that the subgraph approach has strong graph representation capability (e.g. Gärtner et al. 2003), naïve calculation is obviously infeasible unless the weight parameters have some special structure.
We formulate IGML as a supervised learning problem of the distance function (1) using a pairwise loss function of metric learning (Davis et al. 2007) with a sparse penalty on \(m_{i(H)}\). The resulting optimization problem is computationally infeasible at a glance, because the number of weight parameters is equal to the number of possible subgraphs, which is usually intractable. We overcome this difficulty by introducing safe screening (Ghaoui et al. 2010) and working set selection (Fan et al. 2008) approaches. Both of these approaches can significantly reduce the number of variables, and further, they can be combined with a pruning strategy on the tree traverse of graph mining. These optimization tricks are inspired by two recent studies (Nakagawa et al. 2016) and (Morvan and Vert 2018), which developed safe screening and working set based pruning for a linear prediction model with the LASSO penalty, respectively. By combining these two techniques, we constructed a pathwise optimization method that can obtain a sparse solution of the weight parameter \(m_{i(H)}\) without directly enumerating all possible subgraphs.
To the best of our knowledge, no previous studies can provide an interpretable subgraphbased metric learned in a supervised manner. The advantages of IGML can be summarized as follows:

Because IGML is formulated as a convex optimization problem, the global optimal can be found by the standard gradientbased optimization.

The safe screening and working setbased optimization algorithms make our problem practically tractable without sacrificing optimality.

We can identify a small number of important subgraphs that discriminate different classes.This implies that the resulting metric is easy to compute and highly interpretable, making it useful for a variety of subsequent data analyses.For example, applying the nearest neighbor classification or decision tree on the learned space would be effective.
Moreover, we propose three extensions of IGML. First, we show that IGML is directly applicable to other structured data, such as itemset and sequence data. Second, its application to a triplet based loss function is discussed. Third, we extend IGML to allow similarity information of vertexlabels to be incorporated. We empirically verify the superior or comparable prediction performance of IGML to other existing graph classification methods (most of which are not interpretable). We also show some examples of extracted subgraphs and data analyses on the learned metric space.
The reminder of this paper is organized as follows. In Sect. 2, we review previous studies on graph data analysis. In Sect. 3, we introduce a formulation of our proposed IGML. Section 4 discusses strategies to reduce the size of the IGML optimization problem. The detailed computational procedure of IGML is described in Sect. 5. Three extensions of IGML are presented in Sect. 6. Section 7 reports our empirical evaluation of the effectiveness of IGML on several benchmark datasets.
Note that this paper is an extended version of a preliminary conference paper (Yoshida et al. 2019a). The source code of the program used in our experiments is available at https://github.com/takeuchilab/LearningInterpretableMetricbetweenGraphs.
Related work
Kernelbased approaches have been widely studied for graph data analysis, and they can provide a metric of graph data in a reproducing kernel Hilbert space. In particular, subgraphbased graph kernels are closely related to our study. The graphlet kernel (Shervashidze et al. 2009) creates a kernel through small subgraphs with only about 3–5 vertices, which are called graphlets. The neighborhood subgraph pairwise distance kernel (Costa and Grave 2010) selects pairs of subgraphs from a graph and counts the number of pairs identical to those in another graph. The subgraph matching kernel (Kriege and Mutzel 2012) identifies common subgraphs based on cliques in the product graph of two graphs. The feature space created by these subgraphbased kernels is easy to interpret. However, because the above approaches are unsupervised, it is fundamentally impossible to eliminate subgraphs that are unnecessary for a specific target classification task. Therefore, for example, to create the entire kernel matrix of training data, all the candidate subgraphs in the data must be enumerated once, which becomes intractable even for smallsized subgraphs. In contrast, we consider dynamically “pruning” unnecessary subgraphs through a supervised formulation of metric learning. As we will demonstrate in our later experimental results, this significantly reduces the enumeration cost, allowing our proposed algorithm to deal with the larger size of subgraphs than the simple subgraph based kernels.
There are many other kernels including the shortest path (Borgwardt and Kriegel 2005), random walk (Vishwanathan et al. 2010; Sugiyama and Borgwardt 2015; Zhang et al. 2018b), and spectrumbased (Kondor and Borgwardt 2008; Kondor et al. 2009; Kondor and Pan 2016; Verma and Zhang 2017) approaches. The Weisfeiler–Lehman (WL) kernel (Shervashidze and Borgwardt 2009; Shervashidze et al. 2011), which is based on the graph isomorphism test, is a popular and empirically successful kernel that has been employed in many studies (Yanardag and Vishwanathan 2015; Niepert et al. 2016; Narayanan et al. 2017; Zhang et al. 2018a). Again, all such approaches are unsupervised, and it is difficult to interpret results from the perspective of substructures of a graph. Although several kernels deal with continuous attributes on vertices (Feragen et al. 2013; Orsini et al. 2015; Su et al. 2016; Morris et al. 2016), we only focus on the cases where vertexlabels are discrete due to the associated interpretability.
Because obtaining a good metric is an essential task in data analysis, metric learning has been extensively studied to date, as reviewed in (Li and Tian 2018). However, due to its computational difficulty, metric learning for graph data has not been widely studied. A few studies have considered the edit distance approaches. For example, Bellet et al. (2012) presented a method for learning a similarity function through an edit distance in a supervised manner. Another approach probabilistically formulates the editing process of the graph and estimates the parameters using labeled data Neuhaus and Bunke (2007). However, these approaches cannot provide any clear interpretation of the resulting metric in term of the subgraphs.
Likewise, the deep neural network (DNN) is a standard approach to graph data analysis. The deep graph kernel (Yanardag and Vishwanathan 2015) incorporates neural language modeling, where decomposed substructures of a graph are regarded as sentences. The PATCHYSAN (Niepert et al. 2016) and DGCNN (Zhang et al. 2018a) convert a graph to a tensor by using the WLKernel and convolute it. Several other studies also have combined popular convolution techniques with graph data (Tixier et al. 2018; Atwood and Towsley 2016; Simonovsky and Komodakis 2017). These approaches are supervised, but the interpretability of these DNNs is obviously relatively low. Attention enhances the interpretability of deep learning, but extracting important subgraphs is difficult because attention algorithms for graphs (Lee et al. 2018) only provides the significance of vertex transition on a graph. Another related DNN approach is representation learning. For example, sub2vec (Adhikari et al. 2018) and graph2vec (Narayanan et al. 2017) can embed graph data into a continuous space, but they are unsupervised, and it is difficult to extract substructures that characterize different classes. There are other fingerprint learning methods for graphs by neural networks (e.g. Duvenaud et al. 2015) where the contribution from each node can be evaluated for each dimension of the fingerprint. Although it is possible to highlight substructures for the given input graph, this does not produce important common subgraphs for prediction.
Supervised pattern mining (Cheng et al. 2008; Novak et al. 2009; Thoma et al. 2010) can be used for identifying important subgraphs by enumerating patterns with some discriminative score. However, these approaches usually 1) employ a greedy strategy to add a pattern for which global optimality cannot be guaranteed, and 2) do not optimize a metric or representation. A few other studies (Saigo et al. 2009; Nakagawa et al. 2016) have considered optimizing a linear model on the subgraph features with the LASSO penalty using graph mining. A common idea of these two methods is to traverse a graph mining tree with pruning strategies derived from optimality conditions. Saigo et al. (2009) employed a boostingbased approach, which adds a subgraph that violates the optimality condition most severely at every iteration. It was shown that the maximum violation condition can be efficiently identified by pruning the tree without losing the final solution optimality. Nakagawa et al. (2016) derived a pruning criterion by extending safe screening (Ghaoui et al. 2010), which can safely eliminate unnecessary features before solving the optimization problem. This approach can also avoid enumerating the entire tree while guaranteeing the optimality, and its efficiency compared with the boostingbased approach was demonstrated empirically, mainly because it requires much fewer tree traversals. Further, Morvan and Vert (2018) proposed a similar pruning extension of working set selection for optimizing a higherorder interaction model. Although this paper was not for the graph data, the technique is applicable to the same subgraphbased linear model as in (Saigo et al. 2009) and (Nakagawa et al. 2016). Working set selection is a heuristic feature subset selection strategy that has been widely used in machine learning algorithms, such as support vector machines (e.g., Hsu and Lin 2002). Unlike safe screening, this heuristic selection may eliminate necessary features in the middle of the optimization, but the optimality of the final solution can be guaranteed by iterating subset selection repeatedly until the solution converges. However, these methods can only optimize a linear prediction model. In this study, we focus on metric learning of graphs. Therefore, unlike the above mentioned pruning based learning methods, our aim is to learn a “distance function”. In metric learning, a distance function is typically learned from a loss function defined over a relative relation between samples (usually, pairs or triplets), by which a discriminative feature space that is generally effective for subsequent tasks, such as classification and similaritybased retrieval, is obtained. Inspired by (Nakagawa et al. 2016) and (Morvan and Vert 2018), we derive screening and pruning rules for this setting , and further, we combine them to develop an efficient algorithm.
Formulation of interpretable graph metric learning
Optimization problem
Suppose that the training dataset \(\{(G_i,y_i)\}_{i \in [n]}\) consists of n pairs of a graph \(G_i\) and a class label \(y_i\), where \([n] :=\{1, \ldots , n\}\). Let \({{\mathcal {G}}}\) be the set of all connected subgraphs of \(\{ G_i \}_{i \in [n]}\). In each graph, vertices and edges can be labeled. If \(H \in {{\mathcal {G}}}\) is a connected subgraph of \(G \in {{\mathcal {G}}}\), we write \(H \sqsubseteq G\). Further, let \(\#(H \sqsubseteq G)\) be the frequency of the subgraph H in G. Note that we adopt a definition of frequency that does not allow any vertices or edges among the counted subgraphs to overlap. As a representation of a graph G, we consider the following subgraphbased feature representation:
where g is some monotonically nondecreasing and nonnegative function, such as the identity function \(g(x) = x\) or indicator function \(g(x) = 1_{x>0}\), which takes the value 1 if \(x > 0\), and 0 otherwise. It is widely known that subgraphbased features can effectively represent graphs. For example, \(g(x) = x\) allows all nonisomorphic graphs to be distinguished. A similar idea was shown in (Gärtner et al. 2003) for a frequency that allows overlaps. However, this feature space is practically infeasible because the possible number of subgraphs is prohibitively large.
We focus on how to measure the distance between two graphs, which is essential for a variety of machine learning problems. We consider the following weighted squared distance between two graphs:
where i(H) is the index of the subgraph H for a weight parameter \(m_{i(H)} \ge 0\). To obtain an effective and computable distance metric, we adaptively estimate \(m_{i(H)}\) such that only a small number of important subgraphs have nonzero \(m_{i(H)}\) values.
Let \(\varvec{x}_i \in {\mathbb {R}}^p\) be the feature vector defined by concatenating \(\phi _H(G_i)\) for all \(H \in {{\mathcal {G}}}\) included in the training dataset. Then, we have
where \({\varvec{m}}\in {\mathbb {R}}_+^p\) is a vector of \(m_{i(H)}\), and \({\varvec{c}}_{ij} \in {\mathbb {R}}^p\) is defined as \({\varvec{c}}_{ij}:=({\varvec{x}}_i{\varvec{x}}_j)\circ ({\varvec{x}}_i{\varvec{x}}_j)\) with the elementwise product \(\circ \).
Let \({\mathcal {S}}_i \subseteq [n]\) and \({\mathcal {D}}_i \subseteq [n]\) be the subsets of indices that are in the same and different classes to \(\varvec{x}_i\), respectively. For each of these sets, we select the K most similar inputs to \(\varvec{x}_i\) by using some default metric, such as the graph kernel (further details are presented in Sect. 3.2). As a loss function for \(\varvec{x}_i\), we consider
where \(L, U \in {\mathbb {R}}_+\) are constant parameters satisfying \(U \le L\), and \(\ell _t(x) = [tx]_+^2\) is the standard squared hinge loss function with threshold \(t \in {\mathbb {R}}\). This loss function is a variant of the pairwise loss functions used in metric learning (Davis et al. 2007). The first term in the loss function yields a penalty if \(\varvec{x}_i\) and \(\varvec{x}_l\) are closer than L for \(l \in {{\mathcal {D}}}_i\), and the second term yields a penalty if \(\varvec{x}_i\) and \(\varvec{x}_j\) are more distant than U for \(j \in {{\mathcal {S}}}_i\).
Let \(R({\varvec{m}})=\Vert {\varvec{m}}\Vert _1+\frac{\eta }{2}\Vert {\varvec{m}}\Vert _2^2={\varvec{m}}^\top \varvec{1}+\frac{\eta }{2}\Vert {\varvec{m}}\Vert _2^2\) be an elasticnet type sparsityinducing penalty, where \(\eta \ge 0\) is a nonnegative parameter. We define our proposed IGML (interpretable graph metric learning) as the following regularized loss minimization problem:
where \(\lambda > 0\) is the regularization parameter. The solution of this problem can provide not only a discriminative metric but also insight into important subgraphs because the sparse penalty is expected to select only a small number of nonzero parameters.
Let \(\varvec{\alpha }\in {\mathbb {R}}^{2nK}_+\) be the vector of dual variables where \(\alpha _{il}\) and \(\alpha _{ij}\) for \(i \in [n], l \in {{\mathcal {D}}}_i\), and \(j \in {{\mathcal {S}}}_i\) are concatenated. The dual problem of (4) is written as follows (see Appendix A for derivation):
where
\({\varvec{t}}:=[L,\ldots ,L,U,\ldots ,U]^\top \in {\mathbb {R}}^{2nK}\) and \({\varvec{C}}:=[\ldots ,{\varvec{c}}_{il},\ldots ,\) \({\varvec{c}}_{ij}, \ldots ] \in {\mathbb {R}}^{p\times 2nK}\). Then, from the optimality condition, we obtain the following relationship between the primal and dual variables:
where \(\ell _t'(x)=2[tx]_+\) is the derivative of \(\ell _t\). When the regularization parameter \(\lambda \) is larger than certain \(\lambda _{\max }\), the optimal solution is \({\varvec{m}} = {\varvec{0}}\). Then, the optimal dual variables are \(\alpha _{il}=\ell _L'(0)=2L\) and \(\alpha _{ij}=\ell _{U}'(0)=0\). By substituting these equations into (6), we obtain \(\lambda _{\max }\) as
Selection of \({{\mathcal {S}}}_i\) and \({{\mathcal {D}}}_i\)
For \(K = {{\mathcal {S}}}_i = {{\mathcal {D}}}_i\), in the experiments reported later, we employed the small number \(K = 10\) and used a graph kernel to select samples in \({{\mathcal {S}}}_i\) and \({{\mathcal {D}}}_i\). Although we simply used a predetermined kernel, selecting the kernel (or its parameter) through crossvalidation beforehand is also possible. Using only a small number of neighbors is a common setting in metric learning. For example, Davis et al. (2007), which is a seminal work on the pairwise approach, only used \(20c^2\) pairs in total, where c is the number of classes. A small K setting has two aims. First, particularly \({{\mathcal {S}}}_i\), adding pairs that are too far apart can be avoided under this setting. Even for a pair of samples with the same labels, enforcing such distant pairs to be close may cause overfitting (e.g., when the sample is an outlier). Second, a small K reduces the computational cost. Because the number of pairs is \(O(n^2)\), adding all of them into the loss term requires a large computational cost. In fact, these two issues are not only for the pairwise formulation but also for other relative loss functions such as the standard triplet loss, for which there exist \(O(n^3)\) triplets. One potential difficulty in selecting \({{\mathcal {D}}}_i\) and \({{\mathcal {S}}}_i\) is the discrepancy between the initial and the optimal metric. The loss function is defined through \({{\mathcal {D}}}_i\) and \({{\mathcal {S}}}_i\), which are selected based on the neighbors in the initial metric, but the optimization of the metric may change the nearest neighbors of each sample. A possible remedy for this problem is to adaptively change \({{\mathcal {D}}}_i\) and \({{\mathcal {S}}}_i\) in accordance with the updated metric (Takeuchi and Sugiyama 2011), though the resulting optimality of this approach is still not known. To the best of our knowledge, this is still an open problem in metric learning, which we consider beyond the scope of this paper. In the experiments (Sect. 7), we show that a nearestneighbor classifier in the learned metric with this heuristics selection of \({{\mathcal {D}}}_i\) and \({{\mathcal {S}}}_i\) shows better or comparable performance to standard graph classification methods, such as a graph neural network.
Creating a tractable subproblem
Because the problems of (4) and (5) are convex, the local solution is equivalent to the global optimal. However, naïvely solving these problems is computationally intractable because of the high dimensionality of \(\varvec{m}\). In this section, we introduce several useful rules for restricting candidate subgraphs while maintaining the optimality of the final solution. Note that the proofs for all the lemmas and theorems are provided in the appendix.
To make the optimization problem tractable, we work with only a small subset of features during the optimization process. Let \({{\mathcal {F}}}\subseteq [p]\) be a subset of features. By fixing \(m_i = 0\) for \(i \notin {{\mathcal {F}}}\), we define subproblems of the original primal \(P_{\lambda }\) and dual \(D_{\lambda }\) problems as follows:
where \(\varvec{m}_{{{\mathcal {F}}}}\), \(\varvec{c}_{ij{{\mathcal {F}}}}\), and \(\varvec{m}_{\lambda } ({\varvec{\alpha }})_{{\mathcal {F}}}\) are subvectors specified by \({{\mathcal {F}}}\). If the size of \({{\mathcal {F}}}\) is moderate, these subproblems are significantly computationally easier to solve than the original problems.
We introduce several criteria that determine whether the feature k should be included in \({{\mathcal {F}}}\) using the techniques of safe screening (Ghaoui et al. 2010) and working set selection (Fan et al. 2008). A general form of our criteria can be written as
where \({\varvec{q}}\in {\mathbb {R}}_+^{2nK}\), \(r\ge 0\), and \(T \in {\mathbb {R}}\) are constants that assume different values for each criterion. If this inequality holds for k, we exclude the kth feature from \({{\mathcal {F}}}\). An important property is that although our algorithm only solves these small subproblems, we can guarantee the optimality of the final solution, as shown later.
However, selecting \({{\mathcal {F}}}\) itself is computationally expensive because the evaluation of (11) requires O(n) computations for each k. Thus, we exploit a tree structure of graphs for determining \({{\mathcal {F}}}\). Figure 1 shows an example of such a tree, which can be constructed by a graph mining algorithm, such as gSpan (Yan and Han 2002). Suppose that the kth node corresponds to the kth dimension of \(\varvec{x}\) (note that the node index here is not the order of the visit). If a graph corresponding to the kth node is a subgraph of the \(k'\)th node, the node \(k'\) is a descendant of k, which is denoted as \(k' \supseteq k\). Then, the following monotonic relation is immediately derived from the monotonicity of \(\phi _H\):
Because any parent node is a subgraph of its children in the gSpan tree Fig. 1, the nonoverlapped frequency \(\#(H \sqsubseteq G)\) of subgraph H in G is monotonically nonincreasing while descending the tree node. Then, the condition (12) is obviously satisfied because for a sequence of \(H \sqsubseteq H' \sqsubseteq H'' \sqsubseteq \cdots \) in the descending path of the tree, \(x_{i,k(H)} = \phi _H(G_i) = g(\#(H \sqsubseteq G))\) is monotonically nonincreasing, where \(x_{i,k(H)}\) is a feature corresponding to H in \(G_i\). Based on this property, the following lemma enables us to prune a node during the tree traversal.
Lemma 1
Let
be a pruning criterion. Then, if the inequality
holds, for any descendant node \(k' \supseteq k\), the following inequality holds:
where \({\varvec{q}}\in {\mathbb {R}}_+^{2nK}\) and \(r\ge 0\) are an arbitrary constant vector and scalar variable, respectively.
This lemma indicates that if the condition (14) is satisfied, we can say that none of the descendant nodes are included in \({{\mathcal {F}}}\). Assuming that the indicator function \(g(x) = 1_{x>0}\) is used in (2), a tighter bound can be obtained through the following lemma.
Lemma 2
If \(g(x) = 1_{x>0}\) is set in (2), the pruning criterion (13) can be replaced with
By comparing the first terms of Lemmas 1 and 2, we see that Lemma 2 is tighter when \(g(x) = 1_{x>0}\) as follows:
A schematic illustration of the optimization algorithm for IGML is shown in Fig. 2 (for further details, see Sect. 5). To generate a subset of features \({\mathcal {F}}\), we first traverse the graph mining tree during which the safe screening/working set selection procedure and their pruning extensions are performed (Step1). Next, we solve the subproblem (9) with the generated \({\mathcal {F}}\) using a standard gradientbased algorithm (Step2). Safe screening is also performed during the optimization iteration in Step2, which is referred to as dynamic screening. This further reduces the size of \({\mathcal {F}}\).
Before moving onto detailed formulations, we summarize our rules to determine \({{\mathcal {F}}}\) in Table 1. The columns represent the different approaches to evaluating the necessity of features, i.e., safe and working set approaches. For the safe approaches, there are further ‘single \(\lambda \)’ (described in Sect. 4.1.2) and ‘range of \(\lambda \)’ (described in Sect. 4.1.3) approaches. The single \(\lambda \) approach considers safe rules for a specific \(\lambda \), while the range of \(\lambda \) approach considers safe rules that can eliminate features for a range of \(\lambda \) (not just a specific value). Both the single and range approaches are based on the bounds of the region in which the optimal solution exists, for which details are given in Sect. 4.1.1. The rows of Table 1 indicate the variation of rules to remove one specific feature and rules to prune all features in a subtree.
Safe screening
Safe screening (Ghaoui et al. 2010) was first proposed to identify unnecessary features in LASSOtype problems. Typically, this approach considers a bounded region of dual variables in which the optimal solution must exist. Then, we can eliminate dual inequality constraints that are never violated given that the solution exists in that region. The wellknown KarushKuhnTucker (KKT) conditions show that this is equivalent to the elimination of primal variables that take value 0 at the optimal solution. In Sect. 4.1.1, we first derive a spherical bound for our optimal solution, and then in Sect. 4.1.2, a rule for safe screening is shown. Section 4.1.3 extends rules that are specifically useful for the regularization path calculation.
Sphere bound for optimal solution
The following theorem provides a hypersphere containing the optimal dual variable \(\varvec{\alpha }^\star \).
Theorem 1
(DGB) For any pair of \({\varvec{m}}\ge {\varvec{0}}\) and \({\varvec{\alpha }}\ge {\varvec{0}}\), the optimal dual variable \({\varvec{\alpha }}^\star \) must satisfy
This bound is called the duality gap bound (DGB), and the parameters \(\varvec{m}\) and \(\varvec{\alpha }\) used to construct the bound are referred to as the reference solution. This inequality reveals that the optimal \(\varvec{\alpha }^{\star }\) should be in the inside of the sphere whose center is the reference solution \(\varvec{\alpha }\) and radius is \(2 \sqrt{P_\lambda ({\varvec{m}})D_\lambda ({\varvec{\alpha }})}\), i.e., twice the square root of the duality gap. Therefore, if the quality of the reference solution \(\varvec{m}\) and \(\varvec{\alpha }\) is better, a tighter bound can be obtained. When the duality gap is zero, meaning that \(\varvec{m}\) and \(\varvec{\alpha }\) are optimal, the radius is shrunk to zero.
If the optimal solution for \(\lambda _0\) is available as a reference solution to construct the bound for \(\lambda _1\), the following bound, called regularization path bound (RPB), can be obtained.
Theorem 2
(RPB) Let \({\varvec{\alpha }}_0^\star \) be the optimal solution for \(\lambda _0\) and \({\varvec{\alpha }}_1^\star \) be the optimal solution for \(\lambda _1\). Then,
This inequality indicates that the optimal dual solution for \(\lambda _1\) (\(\varvec{\alpha }_1^{\star }\)) should be in the sphere whose center is \(\frac{\lambda _0+\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0^\star \) and radius is \(\left\ \frac{\lambda _0\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0^\star \right\ _2\). However, RPB requires the exact solution, which is difficult to obtain in practice due to numerical errors. The relaxed RPB (RRPB) extends RPB to incorporate the approximate solution as a reference solution.
Theorem 3
(RRPB) Assuming that \({\varvec{\alpha }}_0\) satisfies \(\Vert {\varvec{\alpha }}_0{\varvec{\alpha }}_0^\star \Vert _2\le \epsilon \), the optimal solution \({\varvec{\alpha }}_1^\star \) for \(\lambda _1\) must satisfy
In Theorem 1, the reference \(\varvec{\alpha }_0\) is only assumed to be close to \(\varvec{\alpha }_0^\star \) within the radius \(\epsilon \) instead of assuming that \(\varvec{\alpha }_0^\star \) is available. For example, \(\epsilon \) can be obtained using the DGB (Theorem 1).
Similar bounds to those derived here were previously considered for the triplet screening of metric learning on usual numerical data (Yoshida et al. 2018, 2019b). Here, we extend a similar idea to derive subgraph screening.
Safe screening and safe pruning rules
Theorems 1 and 3 identify the regions where the optimal solution exists using a current feasible solution \({\varvec{\alpha }}\). Further, from (6), when \({\varvec{C}}_{k,:}{\varvec{\alpha }}^\star \le \lambda \), we have \(m_k^\star =0\). This indicates that
where \({{\mathcal {B}}}\) is a region containing the optimal solution \({\varvec{\alpha }}^\star \), i.e., \({\varvec{\alpha }}^\star \in {\mathcal {B}}\). As we derived in Sect. 4.1.1, the sphereshaped \({{\mathcal {B}}}\) can be constructed using feasible primal and dual solutions. By solving this maximization problem, we obtain the following safe screening (SS) rule.
Theorem 4
(SS Rule) If the optimal solution \({\varvec{\alpha }}^\star \) exists in the bound \({\mathcal {B}}=\{{\varvec{\alpha }} \mid \Vert {\varvec{\alpha }} {\varvec{q}} \Vert _2^2\le r^2\}\), the following rule holds
Theorem 4 indicates that we can eliminate unnecessary features by evaluating the condition shown in (16). Here, the theorem is written in a general form, and in practice, \(\varvec{q}\) and r can be defined by the center and a radius of one of the sphere bounds, respectively. An important property of this rule is that it guarantees optimality, meaning that the subproblems (9) and (10) have the exact same optimal solution to the original problem if \({{\mathcal {F}}}\) is defined through this rule. However, it is still necessary to evaluate the rule for all p features, which is currently intractable. To avoid this problem, we derive a pruning strategy on the graph mining tree, which we call the safe pruning (SP) rule.
Theorem 5
(SP Rule) If the optimal solution \({\varvec{\alpha }}^\star \) is in the bound \({\mathcal {B}}=\{{\varvec{\alpha }} \mid \Vert {\varvec{\alpha }} {\varvec{q}} \Vert _2^2\le r^2, {\varvec{q}}\ge \varvec{0}\}\), the following rule holds
This theorem is a direct consequence of Lemma 1. If this condition holds for a node k during the tree traversal, a subtree below that node can be pruned. This means that we can safely eliminate unnecessary subgraphs even without enumerating them. In this theorem, note that \({{\mathcal {B}}}\) has an additional nonnegative constraint \(\varvec{q} \ge \varvec{0}\), but this is satisfied by all the bounds in Sect. 4.1.1 because of the nonnegative constraint in the dual problem.
Rangebased safe screening and safe pruning
The SS and SP rules apply to a fixed \(\lambda \). The rangebased extension identifies an interval of \(\lambda \) for which the satisfaction of SS/SP is guaranteed. This is particularly useful for the pathwise optimization or regularization path calculation, where the problem must be solved with a sequence of \(\lambda \). We assume that the sequence is sorted in descending order, as optimization algorithms typically start from the trivial solution \(\varvec{m} = \varvec{0}\). Let \(\lambda =\lambda _1\le \lambda _0\). By combining RRPB with the rule (16), we obtain the following theorem.
Theorem 6
(Rangebased Safe Screening (RSS)) For any k, the following rule holds
where
This rule indicates that we can safely ignore \(m_k\) for \(\lambda \in [\lambda _a, \lambda _0]\), while if \(\lambda _a > \lambda _0\), the weight \(m_k\) cannot be removed by this rule. For SP, the rangebased rule can also be derived from (17).
Theorem 7
(Rangebased Safe Pruning (RSP)) For any \(k' \supseteq k\), the following pruning rule holds:
where
This theorem indicates that, while \(\lambda \in [\lambda _a',\lambda _0]\), we can safely remove the entire subtree of k. Analogously, if the feature vector is generated from \(g(x) = 1_{x>0}\) (i.e., binary), the following theorem holds.
Theorem 8
(RangeBased Safe Pruning (RSP) for binary feature) Assuming \(g(x) = 1_{x>0}\) in (2), a and b in Theorem 7can be replaced with
Because these constants a and b are derived from the tighter bound in Lemma 2, the obtained range becomes wider than the range in Theorem 7.
Once we calculate \(\lambda _a\) and \(\lambda '_a\) of (18) and (19) for some \(\lambda \), they are stored at each node of the tree. Subsequently, such \(\lambda _a\) and \(\lambda '_a\) can be used for the next tree traversal with different \(\lambda '\). If the conditions of (18) or (19) are satisfied, the node can be skipped (RSS) or pruned (RSP). Otherwise, we update \(\lambda _a\) and \(\lambda '_a\) by using the current reference solution.
Working set method
Safe rules are strong rules in the sense that they can completely remove features; thus, they are sometimes too conservative to fully accelerate the optimization. In contrast, the working set selection is a widely accepted heuristic approach to selecting a subset of features.
Working set selection and working set pruning
The working set (WS) method optimizes the problem with respect to only selected working set features. Then, if the optimality condition for the original problem is not satisfied, the working set is reselected and the optimization on the new working set restarts. This process iterates until optimality on the original problem is achieved.
Besides the safe rules, we use the following WS selection criterion, which is obtained directly from the KKT conditions:
If this inequality is satisfied, the kth dimension is predicted as \(m_k^\star =0\). Hence, the working set is defined by
Although \(m^\star _i = 0\) for \(i \notin {{\mathcal {W}}}\) is not guaranteed, the final convergence of the procedure is guaranteed by the following theorem.
Theorem 9
(Convergence of WS) Assume that there is a solver for the subproblem (9) (or equivalently (10)) that returns the optimal solution for given \({{\mathcal {F}}}\). The working set method, which iterates optimizating the subproblem with \({{\mathcal {F}}}= {{\mathcal {W}}}\) and updating \({{\mathcal {W}}}\) alternately, returns the optimal solution of the original problem in finite steps.
However, here again, the inequality (20) needs to be evaluated for all features, which is computationally intractable.
The same pruning strategy as for SS/SP can be incorporated into working set selection. The criterion (20) is also a special case of (11), and Lemma 1 indicates that if the following inequality
holds, then no \(k' \supseteq k\) is included in the working set. We refer to this criterion as working set pruning (WP).
Relation with safe rules
Note that for the working set method, we may need to update \({{\mathcal {W}}}\) multiple times, unlike in the safe screening approaches, as shown by Theorem 9. Instead, the working set method can usually exclude a larger number of features compared with safe screening approaches. In fact, when the condition of the SS rule (16) is satisfied, the WS criterion (20) must likewise be satisfied. Because all the spheres (DGB, RPB and RRPB) contain the reference solution \(\varvec{\alpha }\), which is usually the current solution, the inequality
holds, where \({{\mathcal {B}}}\) is a sphere created by DGB, RPB or RRPB. This indicates that when the SS rule excludes the kth feature, the WS also excludes the kth feature. However, to guarantee convergence, WS needs to be fixed until the subproblem (9)–(10) is solved (Theorem 9). In contrast, the SS rule is applicable anytime during the optimization procedure without affecting the final optimality. This enables us to apply the SS rule even to the subproblem (9)–(10), where \({{\mathcal {F}}}\) is defined by WS as shown in Step 2 of Fig. 2 (dynamic screening).
For the pruning rules, we first confirm the following two properties:
where \(\varvec{q} \in {\mathbb {R}}_+^{2 n K}\) is the center of the sphere, \(r \ge 0\) is the radius, and \(C \in {\mathbb {R}}\) is a constant. In the case of DGB, the center of the sphere is the reference solution \(\varvec{\alpha }\) itself, i.e., \(\varvec{q} = \varvec{\alpha }\). Then, the following relation holds between the SP criterion \(\mathrm{Prune}(k\varvec{q},r)\) and WP criterion \(\mathrm{Prune}_{\mathrm{WP}}(k)\):
This once more indicates that when the SP rule is satisfied, the WP rule must be satisfied as well. When the RPB or RRPB sphere is used, the center of the sphere is \(\varvec{q} = \frac{\lambda _0 + \lambda _1}{2 \lambda _0} \varvec{\alpha }_0\). Assuming that the solution for \(\lambda _0\) is used as the reference solution, i.e., \(\varvec{\alpha }= \varvec{\alpha }_0\), we obtain
Using this inequality, we obtain
From this inequality, if \(\lambda _1 > \lambda _0\), then \(\mathrm{Prune}(k\varvec{q},r) > \mathrm{Prune}_{\mathrm{WP}}(k)\) (note that \(\mathrm{Prune}_{\mathrm{WP}}(k) \ge 0\) because \(\varvec{\alpha }\ge \varvec{0}\)), indicating that the pruning of WS is always tighter than that of the safe rule. However, in our algorithm presented in Sect. 5, \(\lambda _1 < \lambda _0\) holds because we start from a larger value of \(\lambda \) and gradually decrease it. Then, this inequality does not hold, and \(\mathrm{Prune}(k\varvec{q},r) < \mathrm{Prune}_{\mathrm{WP}}(k)\) becomes a possibility.
When the WS and WP rules are strictly tighter than the SS and SP rules, respectively, using both of WS/WP and SS/SP rules is equivalent to using WS/WP only (except for dynamic screening). Even in this case, the rangebased safe approaches (the RSS and RSP rules) can still be effective. When the rangebased rules are evaluated, we obtain the range of \(\lambda \) such that the SS or SP rule is satisfied. Thus, as long as \(\lambda \) is in that range, we do not need to evaluate any safe or working set rules.
Algorithm and computations
Training with pathwise optimization
We employ pathwise optimization (Friedman et al. 2007), where the optimization starts from \(\lambda = \lambda _{\max }\), which gradually decreases \(\lambda \) while optimizing \(\varvec{m}\). As can be seen from (8), \(\lambda _{\max }\) is defined by the maximum of the inner product \(\varvec{C}_{k,:}\varvec{\alpha }\). This value can also be found by a tree search with pruning. Suppose that we calculate \(\varvec{C}_{k,:}\varvec{\alpha }\) while traversing the tree and \({\hat{\lambda }}_{\max }\) is the current maximum value during the traversal. Using Lemma 1, we can derive the pruning rule
If this condition holds, the descendant nodes of k cannot be maximal, and thus we can identify \(\lambda _{\max }\) without calculating \(\varvec{C}_{k,:}\varvec{\alpha }\) for all candidate features.
Algorithm 1 shows the outer loop of our pathwise optimization. The TRAVERSE and SOLVE functions in Algorithm 1 are shown in Algorithm 2 and 3, respectively. Algorithm 1 first calculates \(\lambda _{\max }\) which is the minimum \(\lambda \) at which the optimal solution is \(\varvec{m}^{\star } = \varvec{0}\) (line 3). The outer loop in lines 514 is the process of decreasing \(\lambda \) with the decreasing rate R. The TRAVERSE function in line 7 determines the subset of features \({{\mathcal {F}}}\) by traversing tree with SS and WS. The inner loop (line 913) alternately solves the optimization problem with the current \({{\mathcal {F}}}\) and updates \({{\mathcal {F}}}\) until the duality gap becomes less than the given threshold \(\mathrm{eps}\).
Algorithm 2 shows the TRAVERSE function, which recursively visits tree nodes to determine \({{\mathcal {F}}}\). The variable node.pruning contains \(\lambda '_a\) of RSP, and if the RSP condition (19) is satisfied (line 3), the function returns the current \({{\mathcal {F}}}\) (the node is pruned). The variable node.screening contains \(\lambda _a\) of RSS, and if the RSS condition (18) is satisfied (line 5), this node can be skipped, and the function proceeds to the next node. If these two conditions are not satisfied, the function 1) updates node.pruning and node.screening if update is true, and 2) evaluates the conditions of RSP and WP (line 10), and RSS and WS (line 14). In lines 1718, gSpan expands the children of the current node, and for each child node, the TRAVERSE function is called recursively.
Algorithm 3 shows a solver for the primal problem with the subset of features \({{\mathcal {F}}}\). Although we employ a simple projected gradient algorithm, any optimization algorithm can be used in this process. In lines 710, the SS rule is evaluated at every after \(\mathrm{freq}\) iterations. Note that this SS is only for the subproblems (9) and (10) created by the current \({{\mathcal {F}}}\) (not for the original problems).
Enumerating subgraphs for test data
To obtain a feature vector for test data, we only need to enumerate subgraphs with \(m_k\ne 0\). When gSpan is used as a mining algorithm, a unique code, called minimum DFS code, is assigned to each node. If a DFS code for a node is \((a_1, a_2, \ldots , a_n)\), a child node is represented by \((a_1, a_2, \ldots , a_n, a_{n+1})\). This enables us to prune nodes that does not generate subgraphs with \(m_k\ne 0\). Suppose that a subgraph \((a_1, a_2, a_3) = (x, y, z)\) must be enumerated, and that we are currently at node \((a_1) = (x)\). Then, a child with \((a_1, a_2) = (x, y)\) should be traversed, but a child with \((a_1, a_2) = (x, w)\) cannot generate (x, y, z), and consequently we can stop the traversal of this node.
Postprocessing
Learning mahalanobis distance for selected subgraphs
Instead of \(\varvec{m}\), the following Mahalanobis distance can be considered
where \(\varvec{M}\) is a positive definite matrix. Directly optimizing \(\varvec{M}\) requires \(O(p^2)\) primal variables and semidefinite constraint, making the problem computationally expensive, even for relatively small p. Thus, as optional postprocessing, we consider optimizing the Mahalanobis distance (22) for a small number of subgraphs selected by the optimized \(\varvec{m}\). Let \({{\mathcal {H}}}\subseteq {{\mathcal {G}}}\) be a set of subgraphs \(m_{i(H)} > 0\) for \(H \in {{\mathcal {H}}}\) and \(\varvec{z}_i\) be a \(h :={{\mathcal {H}}}\) dimensional feature vector consisting of \(\phi _H(G_i)\) for \(H \in {{\mathcal {H}}}\). For \(\varvec{M} \in {\mathbb {R}}^{h \times h}\), we consider the following metric learning problem:
Above, \(R: {\mathbb {R}}^{h \times h} \rightarrow {\mathbb {R}}\) is a regularization term for \(\varvec{M}\), where a typical setting is \(R(\varvec{M}) = \mathrm{tr} \varvec{M} + \frac{\eta }{2} \Vert \varvec{M} \Vert _F^2\) with \(\mathrm{tr}\) representing the trace of a matrix. This metric can be more discriminative, because it is optimized to the training data with a higher degree of freedom.
Vector representation of a graph
An explicit vector representation of an input graph can be obtained using optimized \({\varvec{m}}\) as follows:
Unlike the original \(\varvec{x}_i\), the new representation \(\varvec{x}_i'\) is computationally tractable because of the sparsity of \(\varvec{m}\), and simultaneously, this space should be highly discriminative. This property is beneficial for further analysis of the graph data. We show an example of applying the decision tree to the learned space later in the paper.
In the case of the general Mahalanobis distance given in Sect. 5.3.1, we can obtain further transformation. Let \(\varvec{M} = \varvec{V} \varvec{\Lambda }\varvec{V}^\top \) be the eigenvalue decomposition of the learned \(\varvec{M}\). By employing the regularization term \(R(\varvec{M}) = \mathrm{tr} \varvec{M} + \frac{\eta }{2} \Vert \varvec{M} \Vert _F^2\), some of the eigenvalues of \(\varvec{M}\) can be shrunk to 0 because \(\mathrm{tr} \varvec{M}\) is equal to the sum of the eigenvalues. If \(\varvec{M}\) has \(h' < h\) nonzero eigenvalues, \(\varvec{\Lambda }\) can be written as a \(h' \times h'\) diagonal matrix, and \(\varvec{V}\) is a \(h \times h'\) matrix such that each column is the eigenvector of a nonzero eigenvalue. Then, a representation of the graph is
This can be considered as a supervised dimensionality reduction from h to \(h'\)dimensional space. Although each dimension no longer corresponds to a subgraph in this representation, the interpretation remains clear because each dimension of the transformed vector is simply a linear combination of \(\varvec{z}_i\).
Extensions
In this section, we consider three extensions of IGML: applications to other data types, employing a triplet loss function, and introducing vertexlabel similarity.
Application to other structured data
In addition to graph data, the proposed method can be applied to itemset/sequence data . For an itemset, the Jaccard index, defined as the size of the intersection of two sets divided by the size of the union, is the most popular similarity measure. Although a few studies have considered kernels for an itemset (Zhang et al. 2007), to the best of our knowledge, it remains difficult to adapt a metric on a given labeled dataset in an interpretable manner. In contrast, there are many kernel approaches for sequence data. The spectrum kernel (Leslie et al. 2001) creates a kernel matrix by enumerating all klength subsequences in the given sequence. The mismatch kernel (Leslie et al. 2004) enumerates subsequences allowing m discrepancies in a pattern of length k. The gappy kernel (Leslie and Kuang 2004; Kuksa et al. 2008) counts the number of kmers (subsequences) with a certain number of gaps g that appear in the sequence. The above kernels require the value of hyperparameter k, although various lengths may in fact be related. The motif kernel (Zhang and Zaki 2006; Pissis et al. 2013; Pissis 2014) counts the number of “motifs” appearing in the input sequences, the “motif” must be decided by the user. Because these approaches are based on the idea of the ‘kernel’, they are unsupervised, unlike our approach.
By employing a similar approach to the graph input, we can construct a feature representation \(\phi _H(X_i)\) for both itemset and sequence data. For the itemset data, the ith input is a set of items \(X_i \subseteq {{\mathcal {I}}}\), where \({{\mathcal {I}}}\) is a set of all items, e.g., \(X_1 = \{ a, b \}, X_2 = \{ b, c, e \}, \ldots \) with the candidate items \({{\mathcal {I}}}= \{a, b, c, d, e\}\). The feature \(\phi _H(X_i)\) is defined by \(1_{H \subseteq X_i}\) for \(\forall H \subseteq {{\mathcal {I}}}\). This feature \(\phi _H(X_i)\) also has monotonicity \(\phi _{H^\prime }(X_i) \le \phi _{H}(X_i)\) for \(H^\prime \supseteq H\). In sequence data, the ith input \(X_i\) is a sequence of items. Thus, the feature \(\phi _H(X_i)\) is defined from the frequency of a subsequence H in the given \(X_i\). For example, if we have \(X_i = \langle b, b, a, b, a, c, d \rangle \) and \(H = \langle b, a \rangle \), then H occurs twice in \(X_i\). For sequence data, the monotonicity property is again guaranteed because \(\phi _{H^\prime }(X_i) \le \phi _{H}(X_i)\), where H is a subsequence of \(H^\prime \). Because of these monotonicity properties, we can apply the same pruning procedures to both of itemset and sequence data. Figure 3 shows examples of trees that can be constructed by itemset and sequence mining algorithms (Agrawal et al. 1994; Pei et al. 2001).
Triplet loss
We formulate the loss function of IGML as the pairwise loss (3). Triplet loss functions are also widely used in metric learning (e.g., Weinberger and Saul 2009):
where \({{\mathcal {T}}}\) is an index set of triplets consisting of (i, j, l) satisfying \(y_i=y_j, y_i\ne y_l\). This loss incurs a penalty when the distance between samples in the same class is larger than the distance between samples in different classes. Because the loss is defined by a ‘triplet’ of samples, this approach can be more timeconsuming than the pairwise approach. In contrast, the relative evaluation such as \(d_{{\varvec{m}}}({\varvec{x}}_i,{\varvec{x}}_j) < d_{{\varvec{m}}}({\varvec{x}}_i,{\varvec{x}}_l)\) (the jth sample must be closer to the ith sample than the lth sample) can capture the higherorder relations between input objects rather than penalizing the pairwise distance.
A pruning rule can be derived even for the case of triplet loss. By defining \({\varvec{c}}_{ijl}:={\varvec{c}}_{il}{\varvec{c}}_{ij}\), the loss function can be written as
Because this has the same form as pairwise loss with \(L=1\) and \(U=0\), the optimization problem is reduced to the same form as the pairwise case. We require a slight modification of Lemma 1 because of the change of the constant coefficients (i.e., from \({\varvec{c}}_{ij}\) to \({\varvec{c}}_{ijl}\)). The equation (13) is changed to
This is easily proven using
which is an immediate consequence of the monotonicity inequality (12).
Considering vertexlabel similarity
Because IGML is based on the exact matching of subgraphs to create the feature \(\phi _H(G)\), it is difficult to provide a prediction for a graph that does not exactly match many of the selected subgraphs. Typically, this happens when the test dataset has a different distribution of vertexlabels. For example, in the case of the prediction on a chemical compound group whose atomic compositions are largely different from those in the training dataset, the exact match may not be expected as in the case of the training dataset. Therefore, we consider incorporating similarity/dissimilarity information of graph vertexlabels to relax this exact matching constraint. A toy example of vertexlabel dissimilarity is shown in Fig. 4. In this case, the ‘red’ vertex is similar to the ‘green’ vertex, while it is dissimilar to the ‘yellow’ vertex. For example, we can create this type of table by using prior domain knowledge (e.g., chemical properties of atoms). Even when no prior information is available, a similarity matrix can be inferred using any embedding method (e.g., Huang et al. 2017).
Because it is difficult to directly incorporate similarity information into our subgraph isomorphismbased feature \(\phi _H(G)\), we first introduce a relaxed evaluation of inclusion of a graph P in a given graph G. We assume that P is obtained from the gSpan tree of the training data. Our approach is based on the idea of ‘relabeling’ graph vertexlabels in the WeisfeilerLehman (WL) kernel (Shervashidze et al. 2011), which is a wellknown graph kernel with an approximate graph isomorphism test. Figure 5a shows an example of the relabeling procedure, which is performed in a fixed number of recursive steps. The number of steps is denoted as T (\(T = 3\) in the figure) and is assumed to be prespecified. In step h, each graph vertex v has a level h hierarchical label \(L_G(v,h) :=(F^{(h)}, S^{(h)} = [S^{(h)}_1, \ldots , S^{(h)}_n])\), where \(F^{(h)}\) is recursively defined by the level \(h  1\) hierarchical label of the same vertex, i.e., \(F^{(h)} = L_G(v,h1)\), and \(S^{(h)}\) is a multiset created by the level \(h  1\) hierarchical labels \(L_G(v',h1)\) from all neighboring vertices \(v'\) connected to v. Note that a multiset, denoted by ‘[, ]’, is a set where duplicate elements are allowed. For example, in the graph G shown on the right side of Fig. 5a, the hierarchical label of the vertex \(v_1\) on level \(h = 3\) is \(L_G(v_1,3) = ((A, [B]), [(B, [A,C])])\). In this case, \(F^{(3)} = (A, [B])\), which is equal to \(L_G(v_1,2)\), and \(S^{(3)}_1 = (B, [A,C])\), which is equal to \(L_G(v_2,2)\). The original label A can also be regarded as a hierarchical label \((A,[\,])\) on the level \(h = 1\), but it is shown as ‘A’ for simplicity.
We define a relation of the inclusion ‘\(\sqsubseteq \)’ between two hierarchical labels \(L_{P}(u,h) = (F^{(h)}, S^{(h)} = [S^{(h)}_1, \ldots , S^{(h)}_m])\) and \(L_{G}(v,h) = (F^{\prime (h)}, S^{\prime (h)} = [S^{\prime (h)}_1, \ldots , S^{\prime (h)}_n])\), which originate from the two vertices u and v in graphs P and G, respectively. We say that \(L_{P}(v,h)\) is included in \(L_{G}(u,h)\) and denote it by
when the following recursive condition is satisfied:
where \(\sigma : [m] \rightarrow [n]\) is an injection from [m] to [n] (i.e., \(\sigma (i) \ne \sigma (j)\) when \(i \ne j\)), and \(\exists \sigma (\wedge _{i \in [m]} S^{(h)}_i\sqsubseteq S^{\prime (h)}_{\sigma (i)})\) indicates that there exists an injection \(\sigma \) that satisfies \(S^{(h)}_i\sqsubseteq S^{\prime (h)}_{\sigma (i)}\) for \(\forall i \in [m]\). The first condition (27a) is for the case of \(S^{(h)} = S^{\prime (h)} =[\,]\), which occurs at the first level \(h=1\), and in this case, it simply evaluates whether the two hierarchical labels are equal, i.e., \(F^{(h)} = F^{\prime (h)}\). Note that when \(h = 1\), the hierarchical label is simply (X, []), where X is one of the original vertexlabels. In the other case (27b), both of the two conditions \(F^{(h)} \sqsubseteq F^{\prime (h)}\) and \(\exists \sigma (\wedge _{i \in [m]} S^{(h)}_i\sqsubseteq S^{\prime (h)}_{\sigma (i)})\) are recursively defined. Suppose that we already evaluated the level \(h  1\) relation \(L_{P}(u,h1) \sqsubseteq L_{G}(v,h1)\) for all pairs \(\forall (u,v)\) from P and G. Because \(F^{(h)} = L_{P}(u,h1)\) and \(F^{\prime (h)} = L_{G}(v,h1)\), the condition \(F^{(h)} \sqsubseteq F^{\prime (h)}\) is equivalent to \(L_{P}(u,h1) \sqsubseteq L_{G}(v,h1)\), which is assumed to be already obtained on the level \(h1\) computation. Because \(S^{(h)}_i\) and \(S^{\prime (h)}_i\) are also from hierarchical labels on level \(h  1\), the condition \(\exists \sigma (\wedge _{i \in [m]} S^{(h)}_i \sqsubseteq S^{\prime (h)}_{\sigma (i)})\) is also recursive. From the result of the level \(h  1\) evaluations, we can determine whether \(S^{(h)}_i \sqsubseteq S^{\prime (h)}_{j}\) holds for \(\forall (i,j)\). Then, the evaluation of the condition \(\exists \sigma (\wedge _{i \in [n]} S^{(h)}_i \sqsubseteq S^{\prime (h)}_{\sigma (i)})\) is reduced to a matching problem from \(i \in [m]\) to \(j \in [n]\). This problem can be simply transformed into a maximum bipartite matching problem for a pair of \(\{ S^{(h)}_1, \ldots , S^{(h)}_n \}\) and \(\{ S^{\prime (h)}_1,\ldots ,S^{\prime (h)}_m \}\), where edges exist on a set of pairs \(\{ (i,j) \mid S^{(h)}_i\sqsubseteq S^{\prime (h)}_j \}\). When the maximum number of matchings is equal to m, this means that there exists an injection \(\sigma (i)\) satisfying \(\wedge _{i \in [m]} S^{(h)}_i\sqsubseteq S^{\prime (h)}_{\sigma (i)}\). It is well known that the maximum bipartite matching can be reduced to the maximum flow problem, which can be solved in the polynomial time (Goldberg and Tarjan 1988). An example of the inclusion relationship is shown in Fig. 5b.
Let P and G be the numbers of vertices in P and G, respectively. Then, multisets of the level T hierarchical labels of all the vertices in P and G are written as \([L_{P}(u_i,T)]_{i \in [P]} :=[L_{P}(u_1,T), L_{P}(u_2,T), \ldots , L_{P}(u_{P},T)]\) and \([L_{G}(v_i,T)]_{i \in [G]} :=[L_{G}(v_1,T), L_{G}(v_2,T), \ldots , L_{G}(v_{G},T)]\), respectively. For a feature of a given input graph G, we define the approximate subgraph isomorphism feature (ASIF) as follows:
This feature approximately evaluates the existence of a subgraph P in G using the level T hierarchical labels. ASIF satisfies the monotone decreasing property (12), i.e., \(x_{P'\sqsubseteq G}\le x_{P\sqsubseteq G}\) if \(P' \sqsupseteq P\), because the number of conditions in (27) only increases when P grows.
To incorporate label dissimilarity information (as shown in Fig. 4) into ASIF, we first extend the label inclusion relation (26) by using the concept of optimal transportation cost. As a label similaritybased relaxed evaluation of \(L_{P}(v,h) \sqsubseteq L_{G}(u,h)\), we define an asymmetric cost between \(L_P(u,h)\) and \(L_G(v,h)\) as follows
where the second term of (29b) is
which we refer to as the label transportation cost (LTC) representing the optimal transportation from the multiset \(S^{(h)}\) to another multiset \(S^{\prime (h)}\) among the set of all injections \({{\mathcal {I}}}:=\{ \forall \sigma : [m] \rightarrow [n] \mid \sigma (i) \ne \sigma (j) \text { for } i \ne j \}\). The equation (29) has a recursive structure similar to that of (26). The first case (29a) occurs when \(S^{(h)} = S^{\prime (h)} =[\,]\), which is at the first level \(h = 1\). In this case, \(\mathrm {cost}_1\) is defined by \(\mathrm {dissimilarity}(F^{(1)},F^{\prime (1)})\), which is directly obtained as a dissimilarity between original labels since \(F^{(1)}\) and \(F^{\prime (1)}\) stem from the original vertexlabels. In the other case (29b), the cost is recursively defined as the sum of the cost from \(F^{(h)}\) to \(F^{\prime (h)}\) and the optimaltransport cost from \(S^{(h)}\) to \(S^{\prime (h)}\). Although this definition is recursive, as in the case of ASIF, the evaluation can be performed by computing sequentially from \(h = 1\) to \(h = T\). Because \(F^{(h)} = L_{P}(v,h1)\) and \(F^{\prime (h)} = L_{G}(u,h1)\), the first term \(\mathrm {cost}_{h1}(F^{(h)} \rightarrow F^{\prime (h)})\) represents the cost between hierarchical labels on the level \(h1\), which is assumed to already have been obtained. The second term \(\mathrm {LTC}(S^{(h)} \rightarrow S^{\prime (h)}, \mathrm {cost}_{h1})\) evaluates the best match between \([S^{(h)}_1, \ldots , S^{(h)}_m]\) and \([S^{\prime (h)}_1, \ldots , S^{\prime (h)}_n]\), as defined in (30). This matching problem can be seen as an optimal transportation problem, which minimizes the cost of the transportation of m items to n warehouses under the given cost matrix specified by \(\mathrm {cost}_{h1}\). The values of \(\mathrm {cost}_{h1}\) for all the pairs in [m] and [n] are also available from the computation at the level \(h  1\). For the given cost values, the problem of \(\mathrm {LTC}(S^{(h)} \rightarrow S^{\prime (h)}, \mathrm {cost}_{h1})\) can be reduced to a minimumcostflow problem on a bipartite graph with a weight \(\mathrm {cost}_{h1}(S^{(h)}_i \rightarrow S^{\prime (h)}_j, \mathrm {cost}_{h1})\) between \(S^{(h)}_i\) and \(S^{\prime (h)}_j\), which can be solved in polynomial time (Goldberg and Tarjan 1988).
We define an asymmetric transport cost for two graphs P and G, which we call the graph transportation cost (GTC), as LTC from all level T hierarchical labels of P to those of G:
Then, as a feature of the input graph G, we define the following simASIF:
where \(\rho > 0\) is a hyperparameter. This simASIF can be regarded as a generalization of (28) based on the vertexlabel similarity. When \(\mathrm {dissimilarity}(F^{(1)}, F^{\prime (1)}) :=\infty \times 1_{F^{(1)} \ne F^{\prime (1)}}\), the feature (31) is equivalent to (28). Similarly to ASIF, \(\mathrm {GTC}(P\rightarrow G)\) satisfies the monotonicity property
because the number of vertices to transport increases as P grows. Therefore, simASIF (31) satisfies the monotonicity property, i.e., \(x_{P'\rightarrow G}\le x_{P\rightarrow G}\) if \(P'\sqsupseteq P\).
From the definition (31), simASIF always has a positive value \(x_{P \rightarrow G} > 0\) except when \(\mathrm {GTC}(P \rightarrow G) = \infty \), which may not be suitable for identifying a small number of important subgraphs. Further, in simASIF, the bipartite graph in the minimumcostflow calculation \(\mathrm {LTC}(S \rightarrow S', \mathrm {cost}_{h1})\) is always a complete bipartite graph, where all vertices in S are connected to all vertices in \(S'\). Because the efficiency of most of standard minimumcostflow algorithms depends on the number of edges, this may entail a large computational cost. As an extension to mitigate these issues, a threshold can be introduced into simASIF as follows:
where \(t > 0\) is a threshold parameter. In this definition, \(x = 0\) when \(\exp \{\rho \,\mathrm {GTC}(P \rightarrow G)\}\le t\), i.e., \(\mathrm {GTC}(P \rightarrow G) \ge (\log t)/\rho \). This indicates that if a cost is larger than \((\log t)/\rho \), we can regard the cost as \(\infty \). Therefore, at any h, if the cost between \(S^{(h)}_i\) and \(S^{\prime (h)}_j\) is larger than \((\log t)/\rho \), the edge between \(S^{(h)}_i\) and \(S^{\prime (h)}_j\) is not necessary. Then, the number of matching pairs can be less than m in \(\mathrm {LTC}(\cdot )\) because of the lack of edges, and in this case, the cost is regarded as \(\infty \). Furthermore, if \(\mathrm {cost}_h (F^{(h)} \rightarrow F^{\prime (h)})\) is larger than \((\log t)/\rho \) in (29b), the computation of \(\mathrm {LTC} (S^{(h)} \rightarrow S^{\prime (h)}, \mathrm {cost}_{h1})\) is not required because \(x=0\) is determined.
Note that transportationbased graph metrics have been studied (e.g., Titouan et al. 2019), but the purpose of such studies was to evaluate the similarity between two graphs (not inclusion). Our (sim)ASIF provides a feature with the monotonicity property as a natural relaxation of subgraph isomorphism, by which the optimality of our pruning strategy can be guaranteed. In contrast, there have been many studies on inexact graph matching (Yan et al. 2016) such as eigenvector (Leordeanu et al. 2012; Kang et al. 2013), edit distance (Gao et al. 2010), and random walkbased (Gori et al. 2005; Cho et al. 2010) methods. Some of these methods provide a score for the matching, which can be seen as a similarity score between a searched graph pattern and a matched graph. However, they do not guarantee the monotonicity of the similarity score for pattern growth. If the similarity score satisfies monotonicity, it can be combined with IGML. Although we only consider vertexlabels, edgelabels can also be incorporated into (sim)ASIF. A simple approach is to transform a labelededge into a labelednode with two unlabeled edges, such that (sim)ASIF is directly applicable.
Experiments
We evaluate the performance of IGML using the benchmark datasets shown in Table 3. These datasets are available from Kersting et al. (2016). We did not use edge labels because the implementations of compared methods cannot deal with them, and the maximum connected graph is used if the graph is not connected. Note that IGML currently cannot directly deal with continuous attributes, so we did not use them. A possible approach would be to perform discretization or quantization before the optimization, such as taking grid points or applying clustering in the attribute space. Building a more elaborated approach, such as dynamically determining discretization, is a possible future directions. The #maxvertices column in the table indicates the size (number of vertices) of the maximum subgraph considered in IGML. To fully identify important subgraphs, a large value of #maxvertices is preferred, but this can cause a correspondingly large memory requirement to store the gSpan tree. For each dataset, we set the largest value for which IGML could finish with a tractable amount memory. The sets \({\mathcal {S}}_i\) and \({\mathcal {D}}_i\) were selected as the ten nearest neighborhoods of \({\varvec{x}}_i\) (\(K={\mathcal {S}}_i={\mathcal {D}}_i=10\)) by using the WLKernel. A sequence of the regularization coefficients was created by equally spacing 100 grid points on a logarithmic scale between \(\lambda _{\max }\) and \(0.01\lambda _{\max }\). We set the minimum support in gSpan as 0, meaning that all the subgraphs in a given dataset were enumerated for as far as the graph satisfies the #maxvertices constraint. The gSpan tree is mainly traversed when the beginning of each \(\lambda \) as shown in Algorithm 1 (in the case of WSbased approaches, the tree is also traversed at every working set update). Note that the tree is dynamically constructed during this traversal without constructing the entire tree beforehand. In the workingset method, after convergence, it is necessary to traverse the tree again in order to confirm the overall optimality. If the termination condition is not satisfied, optimization with a new working set must be performed. The termination condition for the optimization is that the relative duality gap is less than \(10^{6}\). In the experiment, we used \(g(x) = 1_{x > 0}\) in \(\phi _H(G)\) with Lemma 2 unless otherwise noted. The dataset was randomly divided in such a way that the ratio of partitioning was \(\mathrm train:validation:test = 0.6:0.2:0.2\), and our experimental results were averaged over 10 runs.
Evaluating computational efficiency
In this section, we confirm the effect of the proposed pruning methods. We evaluated four settings: Safe Screening and Pruning: “SSP”, Range based Safe Screening and Pruning: “RSSP”, Working set Selection and Pruning: “WSP”, and the combination of WSP and RSSP: “WSP+RSSP”. Each method performed dynamic screening with DGB at every update of \({\varvec{m}}\). We here used the AIDS dataset, where #maxvertices=30. In this dataset, when we fully traversed the gSpan tree without safe screening/working set selection, the number of tree nodes was more than \(9.126 \times 10^7\), at which point our implementation with gSpan stopped because we ran out of memory.
Figure 6a shows the size of \({{\mathcal {F}}}\) after the first traverse at each \(\lambda \), and the number of nonzero \(m_k\) after the optimization is also shown as a baseline. We first observe that both approaches significantly reduced the number of features. Even for the largest case, where approximately 200 of features were finally selected by \(m_k\), only less than 1000 features remained. We observe that WSP exhibited significantly smaller values than SSP. Instead, WSP may need to perform the entire tree search again because it cannot guarantee the sufficiency of the current \({{\mathcal {F}}}\), while SSP does not need to search the tree again because it guarantees that \({{\mathcal {F}}}\) must contain all \(m_k \ne 0\).
The number of visited nodes in the first traversal at each \(\lambda \) is shown in Fig. 6b. Here, we added RSSP and WSP+RSSP, which are not shown in Fig. 6a. Note that the #remaining dimensions is same for SSP and RSSP, and for WSP and WSP+RSSP. Because RSSP is derived from SSP, it does not change the number of screened features. As we discussed in Sect. 4.2.2, WSP removes more features than RSSP, though it is not safe. We observed that the #visited nodes of SSP was the largest, but it was less than approximately 27000 (\(27000 / 9.126 \times 10^7 \approx 0.0003\)). Comparing SSP and WSP, we see that WSP pruned a larger number of nodes. In contrast, the #visited nodes of RSSP was less than 6000. The difference between SSP and RSSP indicates that a larger number of nodes can be skipped by the rangebased method. Therefore, by combining the node skip by RSSP with the stronger pruning of WSP, the #visited nodes was further reduced. RSSP and WSP+RSSP had larger values at \(\lambda _0\) than the subsequent \(\lambda _i\). This is because of the effect of rangebased screening and pruning. At \(\lambda _0\), every visited node in the tree calculates the ranges in which the screening and pruning rules are satisfied (i.e., RSS and RSP rules), and as a result, some nodes can be skipped during that \(\lambda _i\) is in those ranges. At every \(\lambda _i\) for \(i > 0\), the ranges are updated only in the (nonskipped) visited nodes, and thus, the rangebased rules take the effect except for \(\lambda _0\).
The total time of the pathwise optimization is shown in Table 2. RSSP and WSP+RSSP were fast with regard to the traversing time, and WSP and WSP+RSSP were fast with regard to the solving time. Note that because the tree is dynamically constructed during the traverse, the ‘Traverse’ time includes the time spent on the tree construction. In total, WSP+RSSP was the fastest. These results indicate that our method only took approximately 1 minute to solve the optimization problem with more than \(9.126 \times 10^7\) variables. We also show the computational cost evaluation for other datasets in the Appendix I.
Although we have confirmed that IGML works efficiently on several benchmark datasets, completely elucidating general complexity of IGML is remains as future work. The practical complexity at least depends on the graph size in the training data, #maxvertices, #samples, and the pruning rate. In terms of the graph size, traversing a large graph dataset using gSpan can be intractable because it requires all matched subgraphs to be maintained at each tree node. Therefore, applying IGML to large graphs, e.g., graphs with more than thousands of nodes, would be difficult. Meanwhile, the scalability of IGML depends not only on the sizes of graphs but also strongly on the performance of the pruning. However, we still do not have any general analytic complexity evaluation for the rate of the pruning that avoids exponential worstcase computations. In fact, we observed that there exist datasets in which efficiency of the pruning is not sufficient. For example, on the IMDBBINARY and IMDBMULTI datasets, which are also from (Kersting et al. 2016), a large number of small subgraphs are shared across all the different classes and instances (i.e., \(x_{i,k} = 1\) for \(\forall i\)). Our upper bound in the pruning is based on the fact that \(x_{i,k'} \le x_{i,k}\) for descendant node \(k'\) in the mining tree. This bound becomes tighter when \(x_{i,k} = 0\) for many i because 0 is the lower bound of \(x_{i,k}\). In contrast, when many instances have \(x_{i,k} = 1\), the bound can be loose, making traversal intractable. This is an important open problems common in predictive mining methods (Nakagawa et al. 2016; Morvan and Vert 2018).
Predictive accuracy comparison
In this section, we compare the prediction accuracy of IGML with those of the GraphletKernel (GK)(Shervashidze et al. 2009), ShortestPath Kernel (SPK)(Borgwardt and Kriegel 2005), RandomWalk Kernel (RW)(Vishwanathan et al. 2010), WeisfeilerLehman Kernel (WL)(Shervashidze et al. 2011), and Deep Graph Convolutional Neural Network (DGCNN)(Zhang et al. 2018a). We used the implementations available at the URLs in the footnote^{Footnote 1}. Note that we mainly compared methods for obtaining a metric between graphs. The graph kernel approach is one of most important existing approaches to defining a metric space of nonvector structured data. Although kernel functions are constructed in an unsupervised manner, their high prediction performance has been widely shown. In particular, the WL kernel is known for its comparable classification performance to recent graph neural networks (e.g., Niepert et al. 2016; Morris et al. 2019). Meanwhile, DGCNN can provide a vector representation of an input graph by using the outputs of some middle layer, which can be interpreted that a metric space is obtained through a supervised learning. We did not compare with (Saigo et al. 2009; Nakagawa et al. 2016; Morvan and Vert 2018) as they only focused on specific linear prediction models rather than building a general discriminative space. We employed the knearest neighbors (knn) classifier to directly evaluate the discriminative ability of feature spaces constructed by IGML and each kernel function. We here employed the knn classifier for directly evaluating discriminative ability of feature spaces constructed by IGML and each kernel function. A graph kernel can be seen as an innerproduct \(k(G_i,G_j) = \langle \varphi (G_j), \varphi (G_j) \rangle \), where \(\varphi \) is a projection from a graph to reproducing kernel Hilbert space. Then, the distance can be written as \(\Vert \varphi (G_j)  \varphi (G_j) \Vert = \sqrt{k(G_i,G_i)  2 K(G_i,G_j) + k(G_j,G_j)}\). The values of k for the knn were \(k=1, 3, 5, 7, ..., 49\) and hyperparameters of each method were selected using the validation data, and the prediction accuracy was evaluated on the test data. The graphlet size for GK was set up to 6. The parameter \(\lambda _{\mathrm{RW}}\) for RW was set to the recommended \(\lambda _{\mathrm{RW}}=\max _{i\in {\mathbb {Z}}: 10^i<1/d^2}10^i\), where d denotes the maximum degree. The loop parameter h of WL was selected from 0, 1, 2, ..., 10 by using the validation data. For DGCNN, the number of hidden units and their sortpooling were also selected using the validation data, each ranging from 64, 128, 256 and from \(40\%, 60\%, 80\%\), respectively.
The microF1 score for each dataset is shown in Table 3. “IGML (Diag)” indicates IGML with the weighted squared distance (1), and “IGML (Diag\(\rightarrow \)Full)” indicates that with postprocessing using the Mahalanobis distance (22). “IGML (Diag)” yielded the best or comparable to the best score on seven out of nine datasets. This result is impressive because IGML uses a much simpler metric than the other methods. Among the seven datasets, “IGML (Diag\(\rightarrow \)Full)” slightly improved the mean accuracy on four datasets, but the difference was not significant. This may suggest that the diagonal weighting can have enough performance in many practical settings. WL kernel also exhibited superior performance, showing the best or comparable to the best accuracy on six datasets. DGCNN showed high accuracy with on the DBLP_v1 dataset, which has a large number of samples, while its accuracy was low for the other datasets.
Illustrative examples of selected subgraphs
Figure 7 shows an illustrative example of IGML on the Mutagenicity dataset, where mutagenicity was predicted from a graph representation of molecules. Figure 7a is a graphical representation of subgraphs, each of which has a weight shown in (b). For example, we can clearly see that subgraph #2 is estimated as an important substructure to discriminate different classes. Figure 7c shows a heatmap of the transformation matrix \(\sqrt{{\varvec{\Lambda }}}{\varvec{V}}^\top \) optimized for the thirteen features, containing three nonzero eigenvalues. For example, we see that the subgraphs of #10 and #12 have similar columns in the heatmap. This indicates that these two similar subgraphs (#10 contains #12) are shrunk to almost same representation by the regularization term \(R(\varvec{M})\).
As another example of graph data analysis on the learned representation, we applied the decision tree algorithm to the obtained feature (23) on the Mutagenicity dataset. Although there has been a study constructing a decision tree directly for graph data (Nguyen et al. 2006), it requires a severe restriction on the patterns to be considered for computational feasibility. In contrast, because (23) is a simple vector representation with a reasonable dimension, it is quite easy to apply the decision tree algorithm. We selected two paths from the obtained decision tree as shown in Fig. 8. For example, in the path (a), if a given graph contains “\({\varvec{O}}= {\varvec{N}}\)”, and does not contain “\({\varvec{H}} {\varvec{O}} {\varvec{C}} {\varvec{C}}= {\varvec{C}} {\varvec{C}} {\varvec{H}}\)”, and contains “\({\varvec{N}} {\varvec{C}}= {\varvec{C}} {\varvec{C}}= {\varvec{C}} < \begin{array}{l}{\varvec{C}}\\ {\varvec{C}}\end{array}\)”, the given graph is predicted as \(y=0\) with probability 140/146. Both rules clearly separate the two classes, which is highly insightful as we can trace the process of the decision based on the subgraphs.
Experiments for three extensions
In this section, we evaluate the performance of the three extensions of IGML described in Sect. 6.
First, we evaluated the performance of IGML on itemset and sequence data using the benchmark datasets shown in the first two rows of Tables 4 and 5. These datasets can be obtained from (Dua and Graff 2017) and (Chang and Lin 2011), respectively. We set the maximumpattern size considered by IGML as 30. Table 4 lists the microF1 scores on the itemset datasets. We used knn with the Jaccard similarity as a baseline, where k was selected using the validation set, as described in Sect. 7.2. The scores of both of IGML (Diag) and (Diag\(\rightarrow \)Full) were superior to those of the Jaccard similarity on all datasets. Table 5 lists the microF1 scores on the sequence dataset. Although IGML (Diag) did not outperform the mismatch kernel (Leslie et al. 2004) for the promoters dataset, IGML (Diag\(\rightarrow \)Full) achieved a higher F1score than the kernel on all datasets. Figure 9 shows an illustrative example of sequences identified by IGML on the promoters dataset, where the task was to predict whether an input DNA sequence stems from a promoter region. Figure 9a is a graphical representation of the sequence, and the corresponding weights are shown in (b). For example, the subsequence #1 in (a) can be considered as an important subsequence to discriminate different classes.
Second, we show the results of the triplet formulation described in Sect. 6.2. To create the triplet set \({\mathcal {T}}\), we followed the approach in Shen et al. (2014), where k neighborhoods in the same class \({\varvec{x}}_j\) and k neighborhoods in different classes \({\varvec{x}}_l\) were sampled for each \({\varvec{x}}_i\) (\(k=4\)). Here, IGML with the pairwise loss is referred to as ‘IGML (Pairwise)’, and IGML with the triplet loss is referred to as ‘IGML (Triplet)’. Table 6 compares the microF1 scores of IGML (Pairwise) and IGML (Triplet). IGML (Triplet) showed higher F1scores than IGML (Pairwise) on three of nine datasets, but it was not computable on the two datasets due to running out of memory (OOM). This is because the pruning rule in the triplet case (25) was looser than in the pairwise case.
Finally, we evaluated the simASIF (32). We set the scaling factor of the exponential function as \(\rho =1\), the threshold of the feature as \(t=0.7\), and the number of relabeling steps as \(T=3\). We employed a simple heuristic approach to create a dissimilarity matrix among vertexlabels using labeled graphs in the given dataset. Suppose that the set of possible vertexlabels is \({{\mathcal {L}}}\), and \(f(\ell ,\ell ')\) is the frequency that \(\ell \in {{\mathcal {L}}}\) and \(\ell ' \in {{\mathcal {L}}}\) are adjacent in all graphs of the dataset. By concatenating \(f(\ell ,\ell ')\) for all \(\ell ' \in {{\mathcal {L}}}\), we obtained a vector representation of a label \(\ell \). We normalized this vector representation such that the vector had the unit L2 norm. By calculating the Euclidean distance of this normalized representations, we obtained the dissimilarity matrix of vertexlabels. We are particularly interested in the case where the distribution of the vertexlabel frequency is largely different between the training and test datasets, because in this case the exact matching of IGML may not be suitable to provide a prediction. We synthetically emulated this setting by splitting the training and test datasets using a clustering algorithm. Each input graph was transformed into a vector created by the frequencies of each vertexlabel \(\ell \in {{\mathcal {L}}}\) contained in that graph. Subsequently, we applied the kmeans clustering to split the dataset into two clusters, for which \({{{\mathcal {C}}}}_1\) and \({{{\mathcal {C}}}}_2\) denote sets of assigned data points, respectively. We used \({{\mathcal {C}}}_1\) for the training and validation datasets and \({{\mathcal {C}}}_2\) is used as the test dataset, where \({{\mathcal {C}}}_1 \ge {{\mathcal {C}}}_2\). Following the same partitioning policy as in the above experiments, the size of the validation data was set as the same size of \({{\mathcal {C}}}_2\), resulting from which the size of the training set was \({{\mathcal {C}}}_1  {{\mathcal {C}}}_2\). Table 7 lists the comparison of the microF1 scores on the AIDS, Mutagenicity, and NCI1 datasets. We did not consider other datasets as their training set sizes created from the above procedure were too small. We fixed the #maxvertices of simASIF to 8, which was less than the value in our original IGML evaluation Table 3, because simASIF takes more time than the feature without vertexlabel similarity. For the original IGML, we show the result for the setting in Table 3 and the results with #maxvertices 8. IGML with simASIF was superior to the original IGML for the both #maxvertices settings on the AIDS and NCI1 datasets, although it has smaller #maxvertices settings, as shown in Table 7. On the Mutagenicity dataset, simASIF was inferior to the original IGML reported in Table 3, but in the comparison under the same #maxvertices value, their performances were comparable. These results suggest that when the exact matching of the subgraph is not appropriate, simASIF can improve the prediction performance of IGML.
Performance on frequency feature
In this section, we evaluate IGML with \(g(x)=\log (1+x)\) instead of \(g(x)=1_{x>0}\). Note that because computing the frequency without overlapping \(\#(H\sqsubseteq G)\) is NPcomplete (Schreiber and Schwöbbermeyer 2005), in addition to the exact count, we evaluated the feature defined by an upper bound of \(\#(H\sqsubseteq G)\) (see Appendix J for details). We employed \(\log \) because the scale of the frequency x is highly diversified. Based on the results in Sect. 7.1, we used WSP+RSSP in this section. The #maxvertices for each dataset followed those in Table 3.
The comparison of microF1 scores for the exact \(\#(H\sqsubseteq G)\) and approximation of \(\#(H\sqsubseteq G)\) is shown in Table 8. The exact \(\#(H\sqsubseteq G)\) did not complete five datasets mainly due to the computational difficulty of the frequency counting. In contrast, the approximate \(\#(H\sqsubseteq G)\) completed on all datasets. Overall, for both the exact and approximate frequency features, the microF1 scores were comparable with the case of \(g(x)=1_{x>0}\) shown in Table 3.
Table 9 lists the total times for the pathwise optimization for the exact \(\#(H\sqsubseteq G)\) and the approximation of \(\#(H\sqsubseteq G)\). On the AIDS dataset, the exact \(\#(H\sqsubseteq G)\) did not complete within a day, while the traversal time using approximate \(\#(H\sqsubseteq G)\) was only 8.6 sec. On the BZR dataset, the traversal time using the exact \(\#(H\sqsubseteq G)\) was seven times that using the approximate \(\#(H\sqsubseteq G)\). The solving time for the approximation was lower because \(\mathcal {F}\) after traversing of the approximation was significantly less than that of the exact \(\#(H\sqsubseteq G)\) in this case. Because the approximate \(\#(H\sqsubseteq G)\) is an upper bound of the exact \(\#(H\sqsubseteq G)\), the variation of the values of the exact \(\#(H\sqsubseteq G)\) was smaller than the approximate \(\#(H\sqsubseteq G)\). This resulted in higher correlations among features created by the exact \(\#(H\sqsubseteq G)\). It is known that the elasticnet regularization tends to select correlated features simultaneously (Zou and Hastie 2005), and therefore, \( {{\mathcal {F}}}\) in the case of the exact \(\#(H\sqsubseteq G)\) becomes larger than in the approximate case.
Figure 10 shows the number of visited nodes, size of the feature subset \({{\mathcal {F}}}\) after traversal, and the number of selected features on the AIDS dataset with the approximate \(\#(H\sqsubseteq G)\). This indicates that IGML keeps the number of subgraphs tractable even if \(g(x)=\log (1+x)\) is used as the feature. The #visited nodes was less than 3500, and \(\mathcal {F}\) after traversal was sufficiently close to \(\{k\mid {\hat{m}}_k>0\}\). We see that #visited nodes at \(\lambda _0\) is larger than many subsequent \(\lambda _i\)s, and this is the effect of rangebased rules, as shown in the case of Fig. 6b.
Conclusions
In this paper, we proposed an interpretable metric learning method for graph data, named interpretable graph metric learning (IGML). To avoid computational difficulty, we built an optimization algorithm that combines safe screening, working set selection, and their pruning extensions. We also discussed the three extensions of IGML: (a) applications to other structured data, (b) triplet lossbased formulation, and (c) incorporating vertexlabel similarity into the feature. We empirically evaluated the performance of IGML compared with existing graph classification methods. Although IGML was the only method with clear interpretability, it showed superior or comparable prediction performance compared to other stateoftheart methods. The practicality of IGML was further demonstrated through some illustrative examples of identified subgraphs. Although IGML optimized the metric within tractable time in the experiments, the subgraphs were restricted to moderate sizes (up to 30), and a current major bottleneck for extracting largersized subgraphs is the memory requirement of the gSpan tree. Therefore, mitigating this memory consumption is an important future directions to apply IGML to a wider class of problems.
Data Availability
All datasets used in the experiments are available on online (see Sect. 7 for details).
Code Availability
The source code for the program used in the experiments is available at https://github.com/takeuchilab/LearningInterpretableMetricbetweenGraphs.
Notes
https://github.com/ysig/GraKeL for GK, http://mlcb.is.tuebingen.mpg.de/Mitarbeiter/Nino/Graphkernels/ for the other graph kernel, and https://github.com/muhanzhang/pytorch_DGCNN for DGCNN.
References
Adhikari, B., Zhang, Y., Ramakrishnan, N., & Prakash, B. A. (2018). Sub2vec: Feature learning for subgraphs. In PacificAsia Conference on Knowledge Discovery and Data Mining, (pp. 170–182). Springer.
Agrawal, R., Srikant, R., et al. (1994). Fast algorithms for mining association rules. In Proceedings of 20th international conference on very large data bases, VLDB (vol. 1215, pp. 487–499).
Atwood, J., & Towsley, D. (2016). Diffusionconvolutional neural networks. In Advances in Neural Information Processing Systems (pp. 1993–2001).
Bellet, A., Habrard, A., & Sebban, M. (2012). Good edit similarity learning by loss minimization. Machine Learning, 89(1–2), 5–35.
Borgwardt, K. M., & Kriegel, H.P. (2005). Shortestpath kernels on graphs. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM 2005) (pp. 74–81). IEEE Computer Society.
Brinda, K., & Vishveshwara, S. (2005). A network representation of protein structures: Implications for protein stability. Biophysical Journal, 89(6), 4159–4170.
Chang, C.C. and Lin, C.J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cheng, H., Yan, X., Han, J., & Philip, S. Y. (2008). Direct discriminative pattern mining for effective classification. In 2008 IEEE 24th International Conference on Data Engineering (pp. 169–178). IEEE.
Cho, M., Lee, J., & Lee, K. M. (2010). Reweighted random walks for graph matching. In European conference on Computer vision (pp. 492–505). Springer.
Costa, F., & Grave, K. D. (2010). Fast neighborhood subgraph pairwise distance kernel. In Proceedings of the 27th International Conference on International Conference on Machine Learning (pp. 255–262). Omnipress.
Davis, J. V., Kulis, B., Jain, P., Sra, S., & Dhillon, I. S. (2007). Informationtheoretic metric learning. In Proceedings of the 24th international conference on Machine learning (pp. 209–216). ACM.
Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.
Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., AspuruGuzik, A., & Adams, R. P. (2015). Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems (pp. 2224–2232). Curran Associates, Inc.
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., & Lin, C.J. (2008). LIBLINEAR: A library for large linear classification. Journal of machine learning research, 9, 1871–1874.
Feragen, A., Kasenburg, N., Petersen, J., de Bruijne, M., & Borgwardt, K. (2013). Scalable kernels for graphs with continuous attributes. In Advances in Neural Information Processing Systems (pp. 216–224).
Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.
Gao, X., Xiao, B., Tao, D., & Li, X. (2010). A survey of graph edit distance. Pattern Analysis and applications, 13(1), 113–129.
Gärtner, T., Flach, P., & Wrobel, S. (2003). On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines (pp. 129–143). Springer.
Ghaoui, L. E., Viallon, V., & Rabbani, T. (2010). Safe feature elimination for the lasso and sparse supervised learning problems. arXiv:1009.4219.
Goldberg, A. V., & Tarjan, R. E. (1988). A new approach to the maximumflow problem. Journal of the ACM (JACM), 35(4), 921–940.
Gori, M., Maggini, M., & Sarti, L. (2005). Exact and approximate graph matching using random walks. IEEE transactions on pattern analysis and machine intelligence, 27(7), 1100–1111.
Hsu, C.W., & Lin, C.J. (2002). A simple decomposition method for support vector machines. Machine Learning, 46(1), 291–314.
Huang, X., Li, J., & Hu, X. (2017). Label informed attributed network embedding. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (pp. 731–739).
Kang, U., Hebert, M., & Park, S. (2013). Fast and scalable approximate spectral graph matching for correspondence problems. Information Sciences, 220, 306–318.
Kersting, K., Kriege, N. M., Morris, C., Mutzel, P., & Neumann, M. (2016). Benchmark data sets for graph kernels. http://graphkernels.cs.tudortmund.de.
Kondor, R., & Borgwardt, K. M. (2008). The skew spectrum of graphs. In Proceedings of the 25th international conference on Machine learning (pp. 496–503). ACM.
Kondor, R., & Pan, H. (2016). The multiscale laplacian graph kernel. In Advances in Neural Information Processing Systems (pp. 2990–2998).
Kondor, R., Shervashidze, N., & Borgwardt, K. M. (2009). The graphlet spectrum. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 529–536). ACM.
Kriege, N., & Mutzel, P. (2012). Subgraph matching kernels for attributed graphs. arXiv preprint arXiv:1206.6483.
Kuksa, P., Huang, P.H., & Pavlovic, V. (2008). A fast, largescale learning method for protein sequence classification. In 8th Int. Workshop on Data Mining in Bioinformatics (pp. 29–37).
Lee, J. B., Rossi, R., & Kong, X. (2018). Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1666–1674). ACM.
Leordeanu, M., Sukthankar, R., & Hebert, M. (2012). Unsupervised learning for graph matching. International journal of computer vision, 96(1), 28–45.
Leslie, C., Eskin, E., & Noble, W. S. (2001). The spectrum kernel: A string kernel for svm protein classification. In Biocomputing 2002 (pp. 564–575). World Scientific.
Leslie, C., & Kuang, R. (2004). Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 5(Nov), 1435–1455.
Leslie, C. S., Eskin, E., Cohen, A., Weston, J., & Noble, W. S. (2004). Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4), 467–476.
Li, D., & Tian, Y. (2018). Survey and experimental study on metric learning methods. Neural Networks, 105, 447–462.
Morris, C., Kriege, N. M., Kersting, K., & Mutzel, P. (2016). Faster kernels for graphs with continuous attributes via hashing. In Data Mining (ICDM), 2016 IEEE 16th International Conference on (pp. 1095–1100). IEEE.
Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., & Grohe, M. (2019). Weisfeiler and leman go neural: Higherorder graph neural networks. In The ThirtyThird AAAI Conference on Artificial Intelligence (pp. 4602–4609). AAAI Press.
Morvan, M. L., & Vert, J.P. (2018). WHInter: A working set algorithm for highdimensional sparse second order interaction models. In Proceedings of the 35th International Conference on Machine Learning (vol. 80, pp. 3635–3644). PMLR.
Nakagawa, K., Suzumura, S., Karasuyama, M., Tsuda, K., & Takeuchi, I. (2016). Safe pattern pruning: An efficient approach for predictive pattern mining. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1785–1794). ACM.
Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., & Jaiswal, S. (2017). graph2vec: Learning distributed representations of graphs. CoRR, abs/1707.05005.
Neuhaus, M., & Bunke, H. (2007). Automatic learning of cost functions for graph edit distance. Information Sciences, 177(1), 239–247.
Nguyen, P. C., Ohara, K., Mogi, A., Motoda, H., & Washio, T. (2006). Constructing decision trees for graphstructured data by chunkingless graphbased induction. In PacificAsia Conference on Knowledge Discovery and Data Mining (pp. 390–399). Springer.
Niepert, M., Ahmed, M., & Kutzkov, K. (2016). Learning convolutional neural networks for graphs. In International conference on machine learning (pp. 2014–2023).
Novak, P. K., Lavrač, N., & Webb, G. I. (2009). Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research, 10, 377–403.
Orsini, F., Frasconi, P., & De Raedt, L. (2015). Graph invariant kernels. In Proceedings of the Twentyfourth International Joint Conference on Artificial Intelligence (pp. 3756–3762).
Pei, J., Han, J., MortazaviAsl, B., Pinto, H., Chen, Q., Dayal, U., & Hsu, M.C. (2001). Prefixspan: Mining sequential patterns efficiently by prefixprojected pattern growth. In Proceedings 17th international conference on data engineering (pp. 215–224). IEEE.
Pissis, S. P. (2014). Motexii: structured motif extraction from largescale datasets. BMC bioinformatics, 15(1), 235.
Pissis, S. P., Stamatakis, A., & Pavlidis, P. (2013). Motex: A wordbased hpc tool for motif extraction. In Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics (p. 13). ACM.
Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T., & Tsuda, K. (2009). Gboost: a mathematical programming approach to graph classification and regression. Machine Learning, 75(1), 69–89.
Schreiber, F., & Schwöbbermeyer, H. (2005). Frequency concepts and pattern detection for the analysis of motifs in networks. In Transactions on computational systems biology III (pp. 89–104). Springer.
Shen, C., Kim, J., Liu, F., Wang, L., & Van Den Hengel, A. (2014). Efficient dual approach to distance metric learning. IEEE transactions on neural networks and learning systems, 25(2), 394–406.
Shervashidze, N., & Borgwardt, K. M. (2009). Fast subtree kernels on graphs. In Advances in neural information processing systems (pp. 1660–1668).
Shervashidze, N., Schweitzer, P., Leeuwen, EJv., Mehlhorn, K., & Borgwardt, K. M. (2011). Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep), 2539–2561.
Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., & Borgwardt, K. (2009). Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics (pp. 488–495).
Simonovsky, M., & Komodakis, N. (2017). Dynamic edgeconditioned filters in convolutional neural networks on graphs. In: Proc. CVPR.
Su, Y., Han, F., Harang, R. E., & Yan, X. (2016). A fast kernel for attributed graphs. In Proceedings of the 2016 SIAM International Conference on Data Mining (pp. 486–494). SIAM.
Sugiyama, M., & Borgwardt, K. (2015). Halting in random walk kernels. In Advances in neural information processing systems (pp. 1639–1647).
Takeuchi, I., & Sugiyama, M. (2011). Target neighbor consistent feature weighting for nearest neighbor classification. In Advances in Neural Information Processing Systems (vol. 24, pp. 576–584). Curran Associates, Inc.
Thoma, M., Cheng, H., Gretton, A., Han, J., Kriegel, H.P., Smola, A., et al. (2010). Discriminative frequent subgraph mining with optimality guarantees. Statistical Analysis and Data Mining: The ASA Data Science Journal, 3(5), 302–318.
Titouan, V., Courty, N., Tavenard, R., Laetitia, C., & Flamary, R. (2019). Optimal transport for structured data with application on graphs. In Proceedings of the 36th International Conference on Machine Learning (vol. 97, pp. 6275–6284). PMLR.
Tixier, A. J.P., Nikolentzos, G., Meladianos, P., & Vazirgiannis, M. (2018). Graph classification with 2d convolutional neural networks.
Verma, S., & Zhang, Z.L. (2017). Hunt for the unique, stable, sparse and fast feature learning on graphs. In Advances in Neural Information Processing Systems (pp. 88–98).
Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., & Borgwardt, K. M. (2010). Graph kernels. Journal of Machine Learning Research, 11(Apr), 1201–1242.
Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb), 207–244.
Xie, T., & Grossman, J. C. (2018). Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical Review Letters, 120, 145301.
Yan, J., Yin, X.C., Lin, W., Deng, C., Zha, H., & Yang, X. (2016). A short survey of recent advances in graph matching. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval (pp. 167–174).
Yan, X., & Han, J. (2002). gspan: Graphbased substructure pattern mining. In Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on (pp. 721–724). IEEE.
Yanardag, P., & Vishwanathan, S. (2015). Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1365–1374). ACM.
Yoshida, T., Takeuchi, I., & Karasuyama, M. (2018). Safe triplet screening for distance metric learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2653–2662).
Yoshida, T., Takeuchi, I., & Karasuyama, M. (2019a). Learning interpretable metric between graphs: Convex formulation and computation with graph mining. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1026–1036).
Yoshida, T., Takeuchi, I., & Karasuyama, M. (2019b). Safe triplet screening for distance metric learning. Neural Computation, 31(12), 2432–2491.
Zhang, M., Cui, Z., Neumann, M., & Chen, Y. (2018a). An endtoend deep learning architecture for graph classification. In Proceedings of AAAI Conference on Artificial Inteligence.
Zhang, Y., Liu, Y., Jing, X., & Yan, J. (2007). ACIK: association classifier based on itemset kernel. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (pp. 865–875). Springer.
Zhang, Y., & Zaki, M. J. (2006). Exmotif: efficient structured motif extraction. Algorithms for Molecular Biology, 1(1), 21.
Zhang, Z., Wang, M., Xiang, Y., Huang, Y., & Nehorai, A. (2018b). Retgk: Graph kernels based on return probabilities of random walks. In Advances in Neural Information Processing Systems (pp. 3968–3978).
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301–320.
Funding
This work was supported by MEXT KAKENHI to I.T. (16H06538, 17H00758) and M.K. (16H06538, 17H04694); from JST CREST awarded to I.T. (JPMJCR1302, JPMJCR1502) and PRESTO awarded to M.K. (JPMJPR15N2); from the Collaborative Research Program of Institute for Chemical Research, Kyoto University to M.K. (grant #201833 and #202131); from the MI2I project of the Support Program for Starting Up Innovation Hub from JST awarded to I.T., and M.K.; and from RIKEN Center for AIP awarded to I.T.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: JeanPhilippe Vert.
Appendices
Appendix
A. Dual Problem
The primal problem (4) can be rewritten as
The Lagrange function \({{\mathcal {L}}}\) is
where \({\varvec{\alpha }}\in {\mathbb {R}}^{2nK}\) and \({\varvec{\beta }}\in {\mathbb {R}}_+^p\) are Lagrange multipliers. The dual function \(D_\lambda \) is then
By the definition of the dual function in (33), to minimize \(\mathcal {L}\) with respect to \({\varvec{m}}\), by partially differentiating \(\mathcal {L}\), we obtain
The convex conjugate function of \(\ell _t\) is
which can be written as
From (34), (35), and (36), the dual function can be written as
where
Therefore, although the dual problem can be written as
by maximizing \(D({\varvec{\alpha }},{\varvec{\beta }})\) with respect to \({\varvec{\beta }}\), we obtain a more straightforward dual problem (5).
We obtain \(\alpha _{ij}=\ell '_t(z_{ij})\), used in (7), from the derivative of \({{\mathcal {L}}}\) with respect to \(z_{ij}\).
B. Proof of Lemma 1
From (12), the value of \((x_{i,k'}x_{j,k'})^2\) is bounded as follows:
Using this inequality, the inner product \({\varvec{C}}_{k',:}{\varvec{q}}\) is likewise bounded:
Similarly, the norm \(\Vert {\varvec{C}}_{k',:}\Vert _2\) is bounded:
Therefore, \({\varvec{C}}_{k',:}{\varvec{q}}+r\Vert {\varvec{C}}_{k',;}\Vert _2 \) is bounded by \(\mathrm {Prune}(k  {\varvec{q}}, r)\).
C. Proof of Lemma 2
First, we consider the first term of \(\varvec{C}_{k',:} \varvec{q} + r \Vert \varvec{C}_{k',:}\Vert _2\):
Now, \(x_{i,k'}\in \{0,1\}\) is assumed. Then, if \(x_{i,k'}=0\), we obtain
Meanwhile, if \(x_{i,k'}=1\), we have \(x_{i,k}=1\) from the monotonicity, and subsequently
By using “\(\max \)”, we can unify these two upper bounds into
Employing a similar concept, the norm of \(\varvec{C}_{k',:}\) can also be bounded by
Thus, we obtain
D. Proof of Theorem 1 (DGB)
From the 1/2strong convexity of \(D_\lambda ({\varvec{\alpha }})\), for any \({\varvec{\alpha }}\ge \varvec{0}\) and \({\varvec{\alpha }}^\star \ge \varvec{0}\), we obtain
Applying weak duality \(P_\lambda ({\varvec{m}})\ge D_\lambda ({\varvec{\alpha }}^\star )\) and the optimality condition of the dual problem \(\nabla D_\lambda ({\varvec{\alpha }}^\star )^\top ({\varvec{\alpha }}{\varvec{\alpha }}^\star )\le 0\) to (37), we obtain DGB.
E. Proof of Theorem 2 (RPB)
From the optimality condition of the dual problem (5),
Here, the gradient vector for the optimal solution is
Thus, by substituting this equation into (38) and (39), we get
From \(\lambda _1 \times \) (40)\(+\lambda _0 \times \) (41),
From (34),
By substituting equation (43) into equation (42), we get
Transforming this inequality by completing the square with the complementary conditions \({{\varvec{m}}_i^\star }^\top {\varvec{\beta }}_i^\star =0\) and \({{\varvec{m}}_1^\star }^\top {\varvec{\beta }}_0^\star ,{{\varvec{m}}_0^\star }^\top {\varvec{\beta }}_1^\star \ge 0\), we obtain
Applying \(\Vert {\varvec{m}}_0^\star  {\varvec{m}}_1^\star \Vert _2^2\ge 0\) to this inequality, we obtain RPB.
F. Proof of Theorem 3 (RRPB)
Considering a hypersphere that expands the RPB radius by \(\frac{\lambda _0+\lambda _1}{2\lambda _0}\epsilon \) and replaces the RPB center with \(\frac{\lambda _0+\lambda _1}{2\lambda _0}\varvec{\alpha }_0\), we obtain
Because \(\epsilon \) is defined by \(\Vert \varvec{\alpha }_0^\star \varvec{\alpha }_0\Vert _2\le \epsilon \), this sphere covers any RPB made by \(\varvec{\alpha }_0^\star \) which satisfies \(\Vert \varvec{\alpha }_0^\star \varvec{\alpha }_0\Vert _2\le \epsilon \). Using the reverse triangle inequality
the following is obtained.
By rearranging this, RRPB is obtained.
G. Proof for Theorems 6 (RSS), 7 (RSP) and 8 (RSP for binary feature)
Here, we address only Theorems 6 and 7 because Theorem 8 can be derived in almost the same way as Theorem 7. When \(\lambda _1 = \lambda \) is set in RRPB, the center and the radius of the bound \({{\mathcal {B}}}= \{ \varvec{\alpha }\mid \Vert \varvec{\alpha } \varvec{q} \Vert _2^2 \le r^2 \}\) are \(\varvec{q} = \frac{\lambda _0+\lambda }{2\lambda _0}{\varvec{\alpha }}_0\) and \(r = \left\ \frac{\lambda _0\lambda }{2\lambda _0}{\varvec{\alpha }}_0\right\ _2+\Bigl (\frac{\lambda _0+\lambda }{2\lambda _0}+\frac{\lambda _0\lambda }{2\lambda _0}\Bigr )\epsilon \), respectively. Substituting these \(\varvec{q}\) and r into (16) and (17), respectively, and rearranging them, we can obtain the range in which the screening and pruning conditions hold.
H. Proof of Theorem 9 (Convergence of WS)
By introducing a new variable \({\varvec{s}}\), the dual problem (5) can be written as
We demonstrate the convergence of the WS method on a more general convex problem as follows:
where \(f({\varvec{x}})\) is a \(\gamma \)strong convex function (\(\gamma >0\)). Here, as shown in Algorithm 4, the working set is defined by \(\mathcal {W}_t = \{ j \mid h_j({\varvec{x}}_{t1}) \ge 0 \}\) at every iteration. Then, the updated working set includes all the violated constraints and the constraints on the boundary. We show that Algorithm 4 finishes with finite Tsteps and returns the optimal solution \({\varvec{x}}_{T}={\varvec{x}}^\star \).
Proof
Because f is \(\gamma \)strong convex from the assumption, the following inequality holds:
At step t, the problem can be written using only the active constraint at the optimal solution \({\varvec{x}}_t\) as follows:
From the definition of \({{\mathcal {W}}}_t\), the working set \({{\mathcal {W}}}_{t+1}\) must contain all active constraints \(\{ j \in {{\mathcal {W}}}_t \mid h_j(\varvec{x}_t) = 0\}\) at the step t and can contain other constraints that are not included in \(\mathcal {W}_t\). This means that \({\varvec{x}}_{t+1}\) must be in the feasible region of the optimization problem at the step t (46):
Therefore, from the optimality condition of the optimization problem (46),
From the inequalities (45) and (47), we obtain
If \(\varvec{x}_t\) is not optimal, there exists at least one violated constraint \(h_{j'}(\varvec{x}_t) > 0\) for some \(j'\) because otherwise \(\varvec{x}_t\) is optimal. Then, we see \(\varvec{x}_{t+1} \ne \varvec{x}_t\) because \(\varvec{x}_{t+1}\) should satisfy the constraint \(h_{j'}(\varvec{x}_{t+1}) \le 0\). If \({\varvec{x}}_t\ne {\varvec{x}}_{t+1}\), from \(\Vert {\varvec{x}}_{t+1}{\varvec{x}}_t\Vert ^2>0\),
Thus, the objective function always strictly increases (\(f({\varvec{x}}_t)<f({\varvec{x}}_{t+1})\)). This indicates that the algorithm never encounters the same working set \({{\mathcal {W}}}_t\) as the set of other iterations \(t' \ne t\). For any step t, the optimal value \(f({\varvec{x}}_t)\) with a subset of the original constraints \(\mathcal {W}_t\) must be smaller than or equal to the optimal value \(f({\varvec{x}}^\star )\) of original problem (44) with all constraints. Therefore, \(f({\varvec{x}}_t)\le f({\varvec{x}}^\star )\) is satisfied, and we obtain \(f({\varvec{x}}_T)=f({\varvec{x}}^\star )\) at some finite step T. \(\square \)
I. CPU Time for Other Datasets
Table 10 lists the computational times on the BZR, DD, and FRANKENSTEIN datasets. We first note that RSSP was approximately 2–4 times faster in terms of the traversal time compared with SSP. Next, comparing RSSP and WSP, we see that RSSP was faster for Traverse, and WSP was faster for Solve, as we observed in Table 2. Thus, the combination of WS&SP and RSSP was the fastest for all three datasets in total.
J. Approximating Frequency Without Overlap
Let \(F_G(H)\) be “frequency without overlap” that is the frequency of a subgraph of a given graph where any shared vertices and edges are disallowed for counting. This \(F_G(H)\) is nonincreasing with respect to the growth of H, but computing it is computationally complicated. Assuming that we know where all the subgraphs H appear in graph G, calculating \(F_G(H)\) is equivalent to the problem of finding the maximum independent set, which is NPcomplete (Schreiber and Schwöbbermeyer 2005). In this section, using information obtained in the process of generating the gSpan tree, we approximate the frequency without overlap by its upper bound. This upper bound is also a lower bound of the frequency with overlap.
Figure 11 shows the process of generating the gSpan tree and frequency. In the figure, we consider the frequency of the subgraph H (\(\textcircled {A}\)\(\textcircled {A}\)\(\textcircled {B}\)) contained in the graph G. The graph H is obtained as a pattern extension of graph \(\textcircled {A}\)\(\textcircled {A}\) (green frame) by \(\textcircled {A}\)\(\textcircled {B}\) (red frame). gSpan stores the number of these pattern extensions at each traverse node. We define the count by this extension as \(F_G^{\mathrm{max}}(H)\) (e.g., \(F_G^{\mathrm{max}}(H) = 5\) for \(\textcircled {A}\)\(\textcircled {A}\)\(\textcircled {B}\)). Note that \(F_G^{\mathrm{max}}(H)\) is the frequency of H allowing overlap and duplicately counting matches that are equivalent except for the index of nodes (e.g., \(F^{\mathrm{max}}_G(H)\) for \(\textcircled {A}\)\(\textcircled {A}\)\(\textcircled {B}\)\(\textcircled {A}\)\(\textcircled {A}\) is two in the figure). Suppose that H currently has \(e (>1)\) edges (for example, \(e = 2\) in \(\textcircled {A}\)\(\textcircled {A}\)\(\textcircled {B}\)). We recursively go back the traverse tree (a tree in the right of Fig. 11) until we reach \(e = 1\), i.e., the starting edge that generates H (in the case of \(\textcircled {A}\)\(\textcircled {A}\)\(\textcircled {B}\), the starting edge is \(\textcircled {A}\)\(\textcircled {A}\)). We use the number of unique matches of this starting edges (the number of green frames), which we define as \(F^{\mathrm{approx}}_G(H)\), as an approximation of \(F_G(H)\). Obviously, \(F^{\mathrm{approx}}_G(H)\) is less than or equal to \(F_G^{\mathrm{max}}(H)\). In the example, the number of green frames must be less than or equal to the number of red frames . Further, because only overlaps on the starting edge \(e = 1\) are considered instead of overlaps in entire H, \(F^{\mathrm{approx}}_G(H)\) is greater than or equals to \(F_G(H)\). Therefore, overall, we have \(F_G(H) \le F^{\mathrm{approx}}_G(H) \le F_G^{\mathrm{max}}(H)\). Unfortunately, from the definition, \(F^{\mathrm{approx}}_G(H)\) gives the same value whenever H has the same starting edge. However, this means that \(F^{\mathrm{approx}}_G(H)\) satisfies the monotonicity constraint for our pruning. Because the subgraph counting is a difficult problem and is not the main focus of our study, we employ \(F^{\mathrm{approx}}_G(H)\) as a simple approximation. For our framework, any approximation is applicable given that it satisfies the monotonicity constraint.
Rights and permissions
About this article
Cite this article
Yoshida, T., Takeuchi, I. & Karasuyama, M. Distance metric learning for graph structured data. Mach Learn 110, 1765–1811 (2021). https://doi.org/10.1007/s10994021060093
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994021060093
Keywords
 Metric learning
 Structured data
 Graph mining
 Convex optimization
 Interpretability