Abstract
Nowadays, developing effective techniques able to deal with data coming from structured domains is becoming crucial. In this context kernel methods are the state-of-the-art tool widely adopted in real-world applications that involve learning on structured data. Contrarily, when one has to deal with unstructured domains, deep learning methods represent a competitive, or even better, choice. In this paper we propose a new family of kernels for graphs which exploits an abstract representation of the information inspired by the multilayer perceptron architecture. Our proposal exploits the advantages of the two worlds. From one side we exploit the potentiality of the state-of-the-art graph node kernels. From the other side we develop a multilayer architecture through a series of stacked kernel pre-image estimators, trained in an unsupervised fashion via convex optimization. The hidden layers of the proposed framework are trained in a forward manner and this allows us to avoid the greedy layerwise training of classical deep learning. Results on real world graph datasets confirm the quality of the proposal.
Similar content being viewed by others
References
Abrahamsen TJ, Hansen LK (2011) Regularized pre-image estimation for kernel PCA de-noising: input space regularization and sparse reconstruction. J Signal Process Syst 65(3):403–412
Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL (2012) Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. In: International workshop on ambient assisted living
Anguita D, Ghio A, Oneto L, Ridella S (2012) In-sample and out-of-sample model selection and error estimation for support vector machines. IEEE Trans Neural Netw Learn Syst 23(9):1390–1406
Anguita D, Ghio A, Oneto L, Ridella S (2012) In-sample model selection for trimmed hinge loss support vector machine. Neural Process Lett 36(3):275–283
Anguita D, Ridella S, Sterpi D (2006) Testing the augmented binary multiclass svm on microarray data. In: International joint conference on neural networks
Bakir G, Hofman T, Schölkopf B, Smola AJ, Taskar B, Vishwanathan SVN (2007) Predicting structured data. MIT Press, Cambridge
Bakir GH, Weston J, Schölkopf B (2004) Learning to find pre-images. In: Advances in neural information processing systems
Ben-Israel A, Greville TNE (2003) Generalized inverses: theory and applications. Springer, Berlin
Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Caponnetto A (2005) A note on the role of squared loss in regression. CBCL, MIT, Cambridge
Chen BL, Li M, Wang JX, Wu FX (2014) Disease gene identification by using graph kernels and Markov random fields. Sci China Life Sci 57(11):1054–1063
Cortes C, Vapnik C (1995) Support-vector networks. Mach Learn 20(3):273–297
Da San Martino G, Navarin N, Sperduti A (2016) Ordered decompositional DAG kernels enhancements. Neurocomputing 192:92–103
Davie AM, Stothers AJ (2013) Improved bound for complexity of matrix multiplication. Proc R Soc Edinb: Sect A Math 143(02):351–369
Fouss F, Francoisse K, Yen L, Pirotte A, Saerens M (2012) An experimental investigation of kernels on graphs for collaborative recommendation and semisupervised classification. Neural Netw 31:53–72
Fürnkranz J (2002) Round robin classification. J Mach Learn Res 2:721–747
Gallagher B, Tong H, Eliassi-Rad T, Faloutsos C (2008) Using ghost edges for classification in sparsely labeled networks. In: International conference on Knowledge discovery and data mining, pp 256–264
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hofmann T, Scholkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Kasun LLC, Zhou H, Huang GB, Vong CM (2013) Representational learning with elms for big data. IEEE Intell Syst 28(6):31–34
Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audiovisual emotion recognition. In: IEEE International conference on acoustics, speech and signal processing, pp 3687–3691
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence
Lafferty RI, Kondor J (2002) Diffusion kernels on graphs and other discrete structures. In: International conference machine learning
Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Am Soc Inf Sci Technol 58(7):1019–1031
Mantrach A, Van Zeebroeck N, Francq P, Shimbo M, Bersini H, Saerens M (2011) Semi-supervised classification and betweenness computation on large, sparse, directed graphs. Pattern Recognit 44(6):1212–1224
NetKit-SRL: Network Learning Toolkit for Statistical Relational Learning. http://netkit-srl.sourceforge.net/data.html. [Online; Accessed 1 Dec 2016]
Neumann M, Garnett R, Kersting K (2013) Coinciding walk kernels: parallel absorbing random walks for learning with graphs and few labels. In: Asian conference on machine learning
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: Online learning of social representations. In: International conference on Knowledge discovery and data mining
Pons P, Latapy M (2006) Computing communities in large networks using random walks. J Graph Algorithms Appl 10(2):191–218
Reka A, Barabasi AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47–97
Rifkin R, Klautau A (2004) In defense of one-vs-all classification. J Mach Learn Res 5:101–141
Rosasco L, De Vito E, Caponnetto A, Piana M, Verri A (2004) Are loss functions all the same? Neural Comput 16(5):1063–1076
Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: International conference on computational learning theory. Springer, Berlin
Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Smola AJ, Kondor R (2003) Kernels and regularization on graphs. In: Conference on learning theory
Spitzer F (1981) Reversibility and stochastic networks. SIAM Rev 23(3):400–401
Tang J, Deng C, Huang GB (2016) Extreme learning machine for multilayer perceptron. IEEE Trans Neural Netw Learn Syst 27(4):809–821
Tikhonov AN, Arsenin VIA (1977) Solutions of ill-posed problems. Halsted Press, New York
Vapnik VN (1998) Statistical learning theory. Wiley, New York
Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning, pp 1096–1103
WebKB Project: CMU World Wide Knowledge Base (Web-KB) project. http://www.cs.cmu.edu/~webkb/. Accessed 1 Dec 2016
Yanardag P, Vishwanathan SVN (2015) Deep graph kernels. In: ACM SIGKDD international conference on knowledge discovery and data mining
Author information
Authors and Affiliations
Corresponding author
Appendix: Hyperparameters Selection Results
Appendix: Hyperparameters Selection Results
In this appendix we report the hyperparameters selected during the model selection phase of the experiments presented in Section 5.1. From Table 3, we can draw some observations. First, the SVMs C parameter is generally high, while the \(\lambda \) parameter of the pre-image estimator (the first layer) has more variability. This indicates that the regularization happens mostly in the first layer. Second, the kernel parameters over the different datasets tend to be pretty stable (in the same order of magnitude) for the architectures involving LEDK and RLK kernels. On the contrary, architectures involving MDK kernel tend to show more variability in the selected parameters.
Rights and permissions
About this article
Cite this article
Oneto, L., Navarin, N., Sperduti, A. et al. Multilayer Graph Node Kernels: Stacking While Maintaining Convexity. Neural Process Lett 48, 649–667 (2018). https://doi.org/10.1007/s11063-017-9742-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-017-9742-z