Our intent here is to investigate the power of propagation kernels (pks) for graph classification. Specifically, we ask:
-
(Q1)
:
-
How sensitive are propagation kernels with respect to their parameters, and how should propagation kernels be used for graph classification?
-
(Q2)
:
-
How sensitive are propagation kernels to missing and noisy information?
-
(Q3)
:
-
Are propagation kernels more flexible than state-of-the-art graph kernels?
-
(Q4)
:
-
Can propagation kernels be computed faster than state-of-the-art graph kernels while achieving comparable classification performance?
Towards answering these questions, we consider several evaluation scenarios on diverse graph datasets including chemical compounds, semantic image scenes, pixel texture images, and 3d point clouds to illustrate the flexibility of pks.
Datasets
The datasets used for evaluating propagation kernels come from a variety of different domains and thus have diverse properties. We distinguish graph databases of labeled and attributed graphs, where attributed graphs usually also have label information on the nodes. Also, we separate image datasets where we use the pixel grid graphs from general graphs, which have varying node degrees. Table 1 summarizes the properties of all datasets used in our experiments.Footnote 4
Table 1 Dataset statistics and properties
Labeled Graphs
For labeled graphs, we consider the following benchmark datasets from bioinformatics: mutag, nci1, nci109, and d&d. mutag contains 188 sets of mutagenic aromatic and heteroaromatic nitro compounds, and the label refers to their mutagenic effect on the Gram-negative bacterium Salmonella typhimurium (Debnath et al. 1991). nci1 and nci109 are anti-cancer screens, in particular for cell lung cancer and ovarian cancer cell lines, respectively (Wale and Karypis 2006). d&d consists of 1178 protein structures (Dobson and Doig 2003), where the nodes in each graph represent amino acids and two nodes are connected by an edge if they are less than 6 Ångstroms apart. The graph classes are enzymes and non-enzymes.
Partially Labeled Graphs
The two real-world image datasets msrc 9-class and msrc 21-classFootnote 5 are state-of-the-art datasets in semantic image processing originally introduced in Winn et al. (2005). Each image is represented by a conditional Markov random field graph, as illustrated in Fig. 4a, b. The nodes of each graph are derived by oversegmenting the images using the quick shift algorithm,Footnote 6 resulting in one graph among the superpixels of each image. Nodes are connected if the superpixels are adjacent, and each node can further be annotated with a semantic label. Imagining an image retrieval system, where users provide images with semantic information, it is realistic to assume that this information is only available for parts of the images, as it is easier for a human annotator to label a small number of image regions rather than the full image. As the images in the msrc datasets are fully annotated, we can derive semantic (ground-truth) node labels by taking the mode ground-truth label of all pixels in the corresponding superpixel. Semantic labels are, for example, building, grass, tree, cow, sky, sheep, boat, face, car, bicycle, and a label void to handle objects that do not fall into one of these classes. We removed images consisting of solely one semantic label, leading to a classification task among eight classes for msrc9 and 20 classes for msrc21.
Attributed Graphs
To evaluate the ability of pks to incorporate continuous node attributes, we consider the attributed graphs used in Feragen et al. (2013), Kriege and Mutzel (2012). Apart from one synthetic dataset (synthetic), the graphs are all chemical compounds (enzymes, proteins, pro-full, bzr, cox2, and dhfr). synthetic comprises 300 graphs with 100 nodes, each endowed with a one-dimensional normally distributed attribute and 196 edges each. Each graph class, A and B, has 150 examples, where in A, 10 node attributes were flipped randomly and in B, 5 were flipped randomly. Further, noise drawn from \(\mathcal {N}(0,0.45^2)\) was added to the attributes in B. proteins is a dataset of chemical compounds with two classes (enzyme and non-enzyme) introduced in Dobson and Doig (2003). enzymes is a dataset of protein tertiary structures belonging to 600 enzymes from the brenda database (Schomburg et al. 2004). The graph classes are their ec (enzyme commission) numbers which are based on the chemical reactions they catalyze. In both datasets, nodes are secondary structure elements (sse), which are connected whenever they are neighbors either in the amino acid sequence or in 3d space. Node attributes contain physical and chemical measurements including length of the sse in Ångstrom, its hydrophobicity, its van der Waals volume, its polarity, and its polarizability. For bzr, cox2, and dhfr—originally used in Mahé and Vert (2009)—we use the 3d coordinates of the structures as attributes.
Point Cloud Graphs
In addition, we consider the object database db,Footnote 7 introduced in Neumann et al. (2013). db is a collection of 41 simulated 3d point clouds of household objects. Each object is represented by a labeled graph where nodes represent points, labels are semantic parts (top, middle, bottom, handle, and usable-area), and the graph structure is given by a k-nearest neighbor (k-nn) graph w.r.t. Euclidean distance of the points in 3d space, cf. Fig. 4c. We further endowed each node with a continuous curvature attribute approximated by its derivative, that is, by the tangent plane orientations of its incident nodes. The attribute of node u is given by \(x_{u} = \sum _{v \in \mathcal {N}(u)} 1-|\mathbf {n}_u \cdot \mathbf {n}_v|\), where \(\mathbf {n}_u\) is the normal of point u and \(\mathcal {N}(u)\) are the neighbors of node u. The classification task here is to predict the category of each object. Examples of the 11 categories are glass, cup, pot, pan, bottle, knife, hammer, and screwdriver.
Grid Graphs
We consider a classical benchmark dataset for texture classification (brodatz) and a dataset for plant disease classification (plants). All graphs in these datasets are grid graphs derived from pixel images. That is, the nodes are image pixels connected according to circular symmetric neighbor sets \(N_{r,p}\) as exemplified in Eq. (16). Node labels are computed from the rgb color values by quantization.
brodatz,Footnote 8 introduced in Valkealahti and Oja (1998), covers 32 textures from the Brodatz album with 64 images per class comprising the following subsets of images: 16 “original” images (o), 16 rotated versions (r), 16 scaled versions (s), and 16 rotated and scaled versions (rs) of the “original” images. Figure 5a, b show example images with their corresponding quantized versions (e) and (f). For parameter learning, we used a random subset of 20 % of the original images and their rotated versions, and for evaluation we use test suites similar to the ones provided with the dataset.Footnote 9 All train/test splits are created such that whenever an original image (o) occurs in one split, their modified versions (r,s,rs) are also included in the same split.
The images in plants, introduced in Neumann et al. (2014), are regions showing disease symptoms extracted from a database of 495 rgb images of beet leaves. The dataset has six classes: five disease symptoms cercospora, ramularia, pseudomonas, rust, and phoma, and one class for extracted regions not showing a disease symptom. Figure 5c, d illustrates two regions and their quantized versions (g) and (h). We follow the experimental protocol in Neumann et al. (2014) and use 10 % of the full data covering a balanced number of classes (296 regions) for parameter learning and the full dataset for evaluation. Note that this dataset is highly imbalanced, with two infrequent classes accounting for only 2 % of the examples and two frequent classes covering 35 % of the examples.
Experimental protocol
We implemented propagation kernels in MatlabFootnote 10 and classification performance on all datasets except for db is evaluated by running c-svm classifications using libSVM.Footnote 11 For the parameter analysis (Sect. 7.3), the cost parameter c was learned on the full dataset (\(c \in \{10^{-3}, 10^{-1}, 10^{1}, 10^{3}\}\) for normalized kernels and \(c \in \{10^{-3}, 10^{-2}, 10^{-1},10^{0}\}\) for unnormalized kernels), for the sensitivity analysis (Sect. 7.4), it was set to its default value of 1 for all datasets, and for the experimental comparison with existing graph kernels (Sect. 7.5), we learned it via 5-fold cross-validation on the training set for all methods (\(c \in \{10^{-7}, 10^{-5}, \dots , 10^{5},10^{7} \}\) for normalized kernels and \(c \in \{10^{-7}, 10^{-5}, 10^{-3},10^{-1} \}\) for unnormalized kernels). The number of kernel iterations \(t_{\textsc {max}}\) was learned on the training splits (\(t_{\textsc {max}} \in \{0,1,\dots , 10\}\) unless stated otherwise). Reported accuracies are an average of 10 reruns of a stratified 10-fold cross-validation.
For db, we follow the protocol introduced in Neumann et al. (2013). We perform a leave-one-out (loo) cross validation on the 41 objects in db, where the kernel parameter \(t_{\textsc {max}}\) is learned on each training set again via loo. We further enhanced the nodes by a standardized continuous curvature attribute, which was only encoded in the edge weights in previous work (Neumann et al. 2013).
For all pks, the lsh bin-width parameters were set to \(w_l = 10^{-5}\) for labels and to \(w_a = 1\) for the normalized attributes, and as lsh metrics we chose \(\textsc {m}_l = \textsc {tv}\) and \(\textsc {m}_a = \textsc {l1}\) in all experiments. Before we evaluate classification performance and runtimes of the proposed propagation kernels, we analyze their sensitivity towards the choice of kernel parameters and with respect to missing and noisy observations.
Parameter analysis
To analyze parameter sensitivity with respect to the kernel parameters w (lsh bin width) and \(t_{\textsc {max}}\) (number of kernel iterations), we computed average accuracies over 10 randomly generated test sets for all combinations of w and \(t_{\textsc {max}}\), where \(w \in \{10^{-8},10^{-7}, \dots , 10^{-1} \}\) and \(t_{\textsc {max}} \in \{0,1,\dots ,14\}\) on mutag, enzymes, nci1, and db. The propagation kernel computation is as described in Algorithm 3, that is, we used the label information on the nodes and the label diffusion process as propagation scheme. To assess classification performance, we performed a 10-fold cross validation (cv). Further, we repeated each of these experiments with the normalized kernel, where normalization means dividing each kernel value by the square root of the product of the respective diagonal entries. Note that for normalized kernels we test for larger svm cost values. Figure 6 shows heatmaps of the results.
In general, we see that the pk performance is relatively smooth, especially if \(w < 10^{-3}\) and \(t_{\textsc {max}} > 4\). Specifically, the number of iterations leading to the best results are in the range from \(\{4,\dots ,10\}\) meaning that we do not have to use a larger number of iterations in the pk computations, helping to keep a low computation time. This is especially important for parameter learning. Comparing the heatmaps of the normalized pk to the unnormalized pk leads to the conclusion that normalizing the kernel matrix can actually hurt performance. This seems to be the case for the molecular datasets mutag and nci1. For mutag, Fig. 6a, b, the performance drops from 88.2 to 82.9 %, indicating that for this dataset the size of the graphs, or more specifically the amount of labels from the different kind of node classes, are a strong class indicator for the graph label. Nevertheless, incorporating the graph structure, i.e., comparing \(t_{\textsc {max}}=0\) to \(t_{\textsc {max}}=10\), can still improve classification performance by 1.5 %. For other prediction scenarios such as the object category prediction on the db dataset, Fig. 6g, h, we actually want to normalize the kernel matrix to make the prediction independent of the object scale. That is, a cup scanned from a larger distance being represented by a smaller graph is still a cup and should be similar to a larger cup scanned from a closer view. So, for our experiments on object category prediction we will use normalized graph kernels whereas for the chemical compounds we will use unnormalized kernels unless stated otherwise.
Recall that our propagation kernel schemes are randomized algorithms, as there is randomization inherent in the choice of hyperplanes used during the lsh computation. We ran a simple experiment to test the sensitivity of the resulting graph kernels with respect to the hyperplane used. We computed the pk between all graphs in the datasets mutag, enzymes, msrc9, and msrc21 with \(t_{\textsc {max}} = 10\) 100 times, differing only in the random selection of the lsh hyperplanes. To make comparisons easier, we normalized each of these kernel matrices. We then measured the standard deviation of each kernel entry across these repetitions to gain insight into the stability of the pk to changes in the lsh hyperplanes. The median standard deviations were: mutag: \(5.5 \times 10^{-5}\), enzymes: \(1.1 \times 10^{-3}\), msrc9: \(2.2 \times 10^{-4}\), and msrc21: \(1.1 \times 10^{-4}\). The maximum standard deviations over all pairs of graphs were: mutag: \(6.7 \times 10^{-3}\), enzymes: \(1.4 \times 10^{-2}\), msrc9: \(1.4 \times 10^{-2}\), and msrc21: \(1.1 \times 10^{-2}\). Clearly the pk values are not overly sensitive to random variation due to differing random lsh hyperplanes.
In summary, we can answer (Q1) by concluding that pks are not overly sensitive to the random selection of the hyperplane as well as to the choice of parameters and we propose to learn \(t_{\textsc {max}} \in \{0,1,\dots ,10\}\) and fix \(w \le 10^{-3}\). Further, we recommend to decide on using the normalized version of pks only when graph size invariance is deemed important for the classification task.
Sensitivity to missing and noisy information
This section analyzes the performance of propagation kernels in the presence of missing and noisy information.
To asses how sensitive propagation kernels are to missing information, we randomly selected x % of the nodes in all graphs of db and removed their labels (labels) or attributes (attr), where \(x \in \{0,10,\ldots ,90, 95,98,99,99.5,100\}\). To study the performance when both label and attribute information is missing, we selected (independently) x % of the nodes to remove their label information and \(x\%\) of the nodes to remove their attribute information (labels & attr). Figure 7 shows the average accuracy of 10 reruns. While we see that the accuracy decreases with more missing information, the performance remains stable in the case when attribute information is missing. This suggests that the label information is more important for the problem of object category prediction. Further, the standard error is increasing with more missing information, which corresponds to the intuition that fewer available information results in a higher variance in the predictions.
We also compare the predictive performance of propagation kernels when only some graphs, as for instance graphs at prediction time, have missing labels. Therefore, we divided the graphs of the following datasets, mutag, enzymes, msrc9, and msrc21, into two groups. For one group (fully labeled) we consider all nodes to be labeled, and for the other group (missing labels) we remove x % of the labels at random, where \(x \in \{10,20,\dots ,90,91,\dots ,99\}\). Figure 8 shows average accuracies over 10 reruns for each dataset. Whereas for mutag we do not observe a significant difference of the two groups, for enzymes the graphs with missing labels could only be predicted with lower accuracy, even when only 20 % of the labels were missing. For both msrc datasets, we observe that we can still predict the graphs with full label information quite accurately; however, the classification accuracy for the graphs with missing information decreases significantly with the amount of missing labels. For all datasets removing even 99 % of the labels still leads to better classification results than a random predictor. This result may indicate that the size of the graphs itself bears some predictive information. This observation confirms the results from Sect. 7.3.
The next experiment analyzes the performance of propagation kernels when label information is encoded as attributes in a one-hot encoding. We also examine how sensitive they are in the presence of label noise. We corrupted the label encoding by an increasing amount of noise. A noisy label distribution vector \(\mathbf {n}_u\) was generated by sampling \(n_{u,i} \sim \mathcal {U}(0, 1)\) and normalizing so that \(\sum n_{u,i} = 1\). Given a noise level \(\alpha \), we used the following values encoded as attributes
$$\begin{aligned} \mathbf {x}_u \leftarrow (1-\alpha )\,\mathbf {x}_u + \alpha \, \mathbf {n}_u. \end{aligned}$$
Figure 9 shows average accuracies over 10 reruns for msrc9, msrc21, mutag, and enzymes. First, we see that using the attribute encoding of the label information in a p2k variant only propagating attributes achieves similar performances to propagating the labels directly in pk. This confirms that the Gaussian mixture approximation of the attribute distributions is a reasonable choice. Moreover, we can observe that the performance on msrc9 and mutag is stable across the tested noise levels. For msrc21 the performance drops for noise levels larger than 0.3. Whereas the same happens for enzymes, adding a small amount of noise (\(\alpha =0.1\)) actually increases performance. This could be due to a regularization effect caused by the noise and should be investigated in future work.
Finally, we performed an experiment to test the sensitivity of pks with respect to noise in edge weights. For this experiment, we used the datasets bzr, cox2, and dhfr, and defined edge weights between connected nodes according to the distance between the corresponding structure elements in 3d space. Namely, the edge weight (before row normalization) was taken to be the inverse Euclidean distance between the incident nodes. Given a noise-level \(\sigma \), we corrupted each edge weight by multiplying by random log-normally distributed noise:
$$\begin{aligned} w_{ij} \leftarrow \exp (\log (w_{ij}) + \varepsilon ), \end{aligned}$$
where \(\varepsilon \sim \mathcal {N}(0, \sigma ^2)\). Figure 10 shows the average test accuracy across ten repetitions of 10-fold cross-validation for this experiment. The bzr and cox2 datasets tolerated a large amount of edge-weight noise without a large effect on predictive performance, whereas dhfr was somewhat more sensitive to larger noise levels.
Summing up these experimental results we answer (Q2) by concluding that propagation kernels behave well in the presence of missing and noisy information.
Comparison to existing graph kernels
We compare classification accuracy and runtime of propagation kernels (pk) with the following state-of-the-art graph kernels: the Weisfeiler–Lehman subtree kernel (wl) (Shervashidze et al. 2011), the shortest path kernel (sp) (Borgwardt and Kriegel 2005), the graph hopper kernel (gh) (Feragen et al. 2013), and the common subgraph matching kernel (csm) (Kriege and Mutzel 2012). Table 2 lists all graph kernels and the types of information they are intended for. For all wl computations, we used the fast implementationFootnote 12 introduced in (Kersting et al. 2014). In sp, gh, and csm, we used a Dirac kernel to compare node labels and a Gaussian kernel \(k_a(u,v) = \exp (-\gamma \Vert x_u - x_v\Vert ^2)\) with \(\gamma = {1}/{D}\) for attribute information, if feasible. csm for the bigger datasets (enzymes, proteins, synthetic) was computed using a Gaussian truncated for inputs with \(\Vert x_u - x_v\Vert > 1\). We made this decision to encourage sparsity in the generated (node) kernel matrices, reducing the size of the induced product graphs and speeding up computation. Note that this is technically not a valid kernel between nodes; nonetheless, the resulting graph kernels were always positive definite. For pk and wl the number of kernel iterations (\(t_{\textsc {max}}\) or \(h_{\textsc {max}}\)) and for csm the maximum size of subgraphs (k) was learned on the training splits via 10-fold cross validation. For all runtime experiments all kernels were computed for the largest value of \(t_{\textsc {max}}, h_{\textsc {max}}\), or k, respectively. We used a linear base kernel for all kernels involving count features, and attributes, if present, were standardized. Further, we considered two baselines that do not take the graph structure into account. labels, corresponding to a pk with \(t_{\textsc {max}}=0\), only compares the label proportions in the graphs and a takes the mean of a Gaussian node kernel among all pairs of nodes in the respective graphs.
Table 2 Graph kernels and their intended use
Graph classification on benchmark data
In this section, we consider graph classification for fully labeled, partially labeled, and attributed graphs.
Fully labeled graphs The experimental results for labeled graphs are shown in Table 3. On mutag, the baseline using label information only (labels) already gives the best performance indicating that for this dataset the actual graph structure is not adding any predictive information. On nci1 and nci109, wl performs best; however, propagation kernels come in second while being computed over one minute faster. Although sp can be computed quickly, it performs significantly worse than pk and wl. This is also the case for gh, whose computation time is significantly higher. In general, the results on labeled graphs show that propagation kernels can be computed faster than state-of-the-art graph kernels but achieve comparable classification performance, thus question (Q4) can be answered affirmatively.
Partially labeled graphs To assess the predictive performance of propagation kernels on partially labeled graphs, we ran the following experiments 10 times. We randomly removed 20–80 % of the node labels in all graphs in msrc9 and msrc21 and computed cross-validation accuracies and standard errors. Because the wl-subtree kernel was not designed for partially labeled graphs, we compare pk to two variants: one where we treat unlabeled nodes as an additional label “u” (wl) and another where we use hard labels derived from running label propagation (lp) until convergence (lp
\(+\)
wl). For this experiment we did not learn the number of kernel iterations, but selected the best performing \(t_{\textsc {max}}\) resp. \(h_{\textsc {max}}\).
The results are shown in Table 4. For larger fractions of missing labels, pk obviously outperforms the baseline methods, and, surprisingly, running label propagation until convergence and then computing wl gives slightly worse results than wl. However, label propagation might be beneficial for larger amounts of missing labels. The runtimes of the different methods on msrc21 are shown in Fig. 12 in the “Appendix 1”. wl computed via the string-based implementation suggested in (Shervashidze et al. 2011) is over 36 times slower than pk. These results again confirm that propagation kernels have attractive scalability properties for large datasets. The lp
\(+\)
wl approach wastes computation time while running lp to convergence before it can even begin calculating the kernel. The intermediate label distributions obtained during the convergence process are already extremely powerful for classification. These results clearly show that propagation kernels can successfully deal with partially labeled graphs and suggest an affirmative answer to questions (Q3) and (Q4).
Table 4 Partially labeled graphs
Attributed graphs The experimental results for various datasets with attributed graphs are illustrated in Fig. 11. The plots show runtime versus average accuracy, where the error bars reflect standard deviation of the accuracies. As we are interested in good predictive performance while achieving fast kernel computation, methods in the upper-left corners provide the best performance with respect to both quality and speed. For pk, sp, gh, and csm we compare three variants: one where we use the labels only, one where we use the attribute information only, and one where both labels and attributes are used. wl is computed with label information only. For synthetic, cf. Fig. 11a, we used the node degree as label information. Further, we compare the performance of p2k, which propagates labels and attributes as described in Sect. 6.3. Detailed results on synthetic and all bioinformatics datasets are provided in Table 8 (average accuracies) and Table 7 (runtimes) in the “Appendix 2”. From Fig. 11 we clearly see that propagation kernels tend to appear in the upper-left corner, that is, they are achieving good predictive performance while being fast, leading to a positive answer of question (Q4). Note that the runtimes are shown on a log scale. We can also see that p2k, propagating both labels and attributes, (blue star) usually outperforms the the simple pk implementation not considering attribute arrangements (blue diamond). However, this comes at the cost of being slower. So, we can use the flexibility of propagation kernels to trade predictive quality against speed or vice versa according to the requirements of the application at hand. This supports a positive answer to question (Q3).
Graph classification on novel applications
The flexibility of propagation kernels arising from easily interchangeable propagation schemes and their efficient computation via lsh allows us to apply graph kernels to novel domains. First, we are able to compare larger graphs with reasonable time expended, opening up the use of graph kernels for object category prediction of 3d point clouds in the context of robotic grasping (Neumann et al. 2013). Depending on their size and the perception distance, point clouds of household objects can easily consist of several thousands of nodes. Traditional graph kernels suffer from enormous computation times or memory problems even on datasets like db, which can still be regarded medium sized. These issues aggravate even more when considering image data. So far, graph kernels have been used for image classification on the scene level where the nodes comprise segments of similar pixels and one image is then represented by less than 100 so-called superpixels. Utilizing off-the-shelf techniques for efficient diffusion on grid graphs allows the use of propagation kernels to analyze images on the pixel level and thus opens up a whole area of interesting problems in the intersection of graph-based machine learning and computer vision. As a first step, we apply graph kernels, more precisely propagation kernels, to texture classification, where we consider datasets with thousands of graphs containing a total of several millions of nodes.
3
d
object category prediction In this set of experiments, we follow the protocol introduced in Neumann et al. (2013), where the graph kernel values are used to derive a prior distribution on the object category for a given query object. The experimental results for the 3d-object classification are summarized in Table 5. We observe that propagation kernels easily deal with the point-cloud graphs. From the set of baseline graph kernels considered, only wl was feasible to compute, however with poor performance. Propagation kernels clearly benefit form their flexibility as we can improve the classification accuracy from 75.4 to 80.7 % when considering the object curvature attribute. These results are extremely promising given that we tackle a classification problem with 11 classes having only 40 training examples for each query object.
Table 5 Point cloud graphs
Grid graphs For brodatz and plants we follow the experimental protocol in Neumann et al. (2014). The pk parameter \(t_{\textsc {max}}\) was learned on a training subset of the full dataset (\(t_{\textsc {max}} \in \{0,3,5,8,10,15,20\}\)). For plants this training dataset consists of 10 % of the full data; for brodatz we used 20 % of the brodatz-o-r data as the number of classes in this dataset is much larger (32 textures). We also learned the quantization values (\( col \in \{3,5,8,10,15\}\)) and neighborhoods (\(B \in \{N_{1,4}, N_{1,8}, N_{2,16}\}\), cf. Eq. (16)). For brodatz the best performance on the training data was achieved with 3 colors and a 8-neighborhood, whereas for plants 5 colors and the 4-neighborhood was learned. We compare pk to the simple baseline labels using label counts only and to a powerful second-order statistical feature based on the gray-level co-occurrence matrix (Haralick et al. 1973) comparing intensities (glcm-gray) resp. quantized labels (glcm-quant) of neighboring pixels. The experimental results for grid graphs are shown in Table 6. While not outperforming sophisticated and complex state-of-the-art computer vision approaches to texture classification, we find that it is feasible to compute pks on huge image datasets achieving respectable performance out of the box. This is—compared to the immense tuning of features and methods commonly done in computer vision—a great success. On plants
pk achieves an average accuracy of 82.5 %, where the best reported result so far is 83.7 %, which was only achieved after tailoring a complex feature ensemble (Neumann et al. 2014). In conclusion, propagation kernels are an extremely promising approach in the intersection of machine learning, graph mining, and computer vision.
Summarizing all experimental results, the capabilities claimed in Table 2 are supported. Propagation kernels have proven extremely flexible and efficient and thus question (Q3) can ultimately be answered affirmatively.