Abstract
Measures of statistical dependence between random variables have been successfully applied in many machine learning tasks, such as independent component analysis, feature selection, clustering and dimensionality reduction. The success is based on the fact that many existing learning tasks can be cast into problems of dependence maximization (or minimization). Motivated by this, we present a unifying view of kernel learning via statistical dependence estimation. The key idea is that good kernels should maximize the statistical dependence between the kernels and the class labels. The dependence is measured by the Hilbert–Schmidt independence criterion (HSIC), which is based on computing the Hilbert–Schmidt norm of the cross-covariance operator of mapped samples in the corresponding Hilbert spaces and is traditionally used to measure the statistical dependence between random variables. As a special case of kernel learning, we propose a Gaussian kernel optimization method for classification by maximizing the HSIC, where two forms of Gaussian kernels (spherical kernel and ellipsoidal kernel) are considered. Extensive experiments on real-world data sets from UCI benchmark repository validate the superiority of the proposed approach in terms of both prediction accuracy and computational efficiency.
Similar content being viewed by others
References
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Gao X, Fan L, Xu H (2015) Multiple rank multi-linear kernel support vector machine for matrix data classification. Int J Mach Learn Cybern. doi:10.1007/s13042-015-0383-0
Wang T, Tian S, Huang H, Deng D (2009) Learning by local kernel polarization. Neurocomputing 72(13–15):3077–3084
Gönen M, Alpayın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
Pan B, Chen WS, Xu C, Chen B (2016) A novel framework for learning geometry-aware kernels. IEEE Trans Neural Netw Learn Syst 27(5):939–951
Fukumizu K, Gretton A, Sun X, Schölkopf B (2007) Kernel measures of conditional dependence. Adv Neural Inf Process Syst 20:489–496
Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, Smola A (2007) A kernel statistical test of independence. Adv Neural Inf Process Syst 20:585–592
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A (2012) A kernel two-sample test. J Mach Learn Res 13:723–773
Chwialkowski K, Gretton A (2014) A kernel independence test of random process. In: Proceedings of the 31th International Conference on Machine Learning, Beijing, China, pp 1422–1430
Bach FR, Jordan MI (2002) Kernel independent component analysis. J Mach Learn Res 3:1–48
Gretton A, Smola A, Bousquet O, Herbrich R, Belitski A, Augath M, Murayama Y, Pauls J, Schölkopf B, Logothetis NK (2005) Kernel constrained covariance for dependence measurement. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, Bridgetown, Barbados, pp 112–119
Gretton A, Bousquet O, Smola A, Schölkopf B (2005) Measuring statistical dependence with Hilbert-Schmidt norms. In: Proceedings of the 16th International Conference on Algorithmic Learning Theory, Singapore, pp 63–77
Song L, Smola A, Gretton A, Borgwardt K (2007) A dependence maximization view of clustering. In: Proceedings of the 24th International Conference on Machine Learning, Corvallis, USA, pp 823–830
Zhong W, Pan W, Kwok JT, Tsang IW (2010) Incorporating the loss function into discriminative clustering of structured outputs. IEEE Trans Neural Netw 21(10):1564–1575
Camps-Valls G, Mooij J, Schölkopf B (2010) Remote sensing feature selection by kernel dependence measures. IEEE Geosci Remote Sens Lett 7(3):587–591
Song L, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 13:1393–1434
Chen J, Ji S, Ceran B, Li Q, Wu M, Ye J (2008) Learning subspace kernels for classification. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, USA, pp 106–114
Shu X, Lai D, Xu H, Tao L (2015) Learning shared subspace for multi-label dimensionality reduction via dependence maximization. Neurocomputing 168:356–364
Chapelle O, Vapnik V, Mukherjee S (2002) Choosing multiple parameters for support vector machines. Mach Learn 46(1):131–159
Keerthi SS (2002) Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. IEEE Trans Neural Netw 13(5):1225–1229
Liu Y, Liao S, Hou Y (2011) Learning kernels with upper bounds of leave-one-out error. In: Proceedings of the 20th ACM Conference on Information and Knowledge Management, Glasgow, United Kingdom, pp 2205–2208
Cristianini N, Shawe-Taylor J, Elisseeff A, Kandola J (2001) On kernel-target alignment. Adv Neural Inf Process Syst 14:367–373
Cortes C, Mohri M, Rostamizadeh A (2012) Algorithms for learning kernels based on centered alignment. J Mach Learn Res 13:795–828
Wang T, Zhao D, Tian S (2015) An overview of kernel alignment and its applications. Artif Intell Rev 43(2):179–192
Baram Y (2005) Learning by kernel polarization. Neural Comput 17(6):1264–1275
Wang T, Zhao D, Feng Y (2013) Two-stage multiple kernel learning with multiclass kernel polarization. Knowl-Based Syst 48:10–16
Nguyen CH, Ho TB (2008) An efficient kernel matrix evaluation measure. Pattern Recognit 41(11):3366–3372
Wang L (2008) Feature selection with kernel class separability. IEEE Trans Pattern Anal Mach Intell 30(9):1534–1546
Steinwart I (2001) On the influence of the kernels on the consistency of support vector machines. J Mach Learn Res 2:67–93
Sugiyama M (2012) On kernel parameter selection in Hilbert-Schmidt independence criterion. IEICE Trans Inf Syst E95-D(10):2564–2567
Lu Y, Wang L, Lu J, Yang J, Shen C (2014) Multiple kernel clustering based on centered kernel alignment. Pattern Recognit 47(11):3656–3664
Neumann J, Schnörr C, Steidl G (2005) Combined SVM-based feature selection and classification. Mach Learn 61(1–3):129–150
Lichman M (2013) UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml/
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27. http://www.csie.ntu.edu.tw/~cjlin/libsvm
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Chen PH, Lin CJ, Chölkopf B (2005) A tutorial on—support vector machines. Appl Stoch Models Bus Ind 21(2):111–136
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Chen Z, Haykin S (2002) On different facets of regularization theory. Neural Comput 14(12):2791–2846
Liu P, Huang Y, Meng L, Gong S, Zhang G (2016) Two-stage extreme learning machine for high-dimensional data. Int J Mach Learn Cybern 7(5):765–772
Chen C, Zhang J, He X, Zhou ZH (2012) Non-parametric kernel learning with robust pairwise constraints. Int J Mach Learn Cybern 3(2):83–96
Lin CF, Wang SD (2002) Fuzzy support vector machines. IEEE Trans Neural Netw 13(2):464–471
Yamada M, Jitkrittum W, Sigal L, Xing EP, Sugiyama M (2014) High-dimensional feature selection by feature-wise kernelized Lasso. Neural Comput 26(1):185–207
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China (No. 61562003), the Natural Science Foundation of Jiangxi Province of China (No. 20161BAB202070), and the China Scholarship Council (No. 201508360144). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, T., Li, W. Kernel learning and optimization with Hilbert–Schmidt independence criterion. Int. J. Mach. Learn. & Cyber. 9, 1707–1717 (2018). https://doi.org/10.1007/s13042-017-0675-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-017-0675-7