Hyperparameter Learning for Conditional Kernel Mean Embeddings with Rademacher Complexity Bounds

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11052)


Conditional kernel mean embeddings are nonparametric models that encode conditional expectations in a reproducing kernel Hilbert space. While they provide a flexible and powerful framework for probabilistic inference, their performance is highly dependent on the choice of kernel and regularization hyperparameters. Nevertheless, current hyperparameter tuning methods predominantly rely on expensive cross validation or heuristics that is not optimized for the inference task. For conditional kernel mean embeddings with categorical targets and arbitrary inputs, we propose a hyperparameter learning framework based on Rademacher complexity bounds to prevent overfitting by balancing data fit against model complexity. Our approach only requires batch updates, allowing scalable kernel hyperparameter tuning without invoking kernel approximations. Experiments demonstrate that our learning framework outperforms competing methods, and can be further extended to incorporate and learn deep neural network weights to improve generalization. (Source code available at:

Supplementary material

478890_1_En_14_MOESM1_ESM.pdf (454 kb)
Supplementary material 1 (pdf 454 KB)


  1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Georgia (2016)Google Scholar
  2. Aeberhard, S., Coomans, D., De Vel, O.: Comparison of classifiers in high dimensional settings. Department of Mathematics and Statistics, James Cook University, North Queensland, Australia, Technical report (92-02) (1992)Google Scholar
  3. Aly, M.: Survey on multiclass classification methods. Neural Netw. 19, 1–9 (2005)Google Scholar
  4. Bache, K., Lichman, M.: UCI machine learning repository (2013)Google Scholar
  5. Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002)MathSciNetzbMATHGoogle Scholar
  6. Chen, Y., Welling, M., Smola, A.: Super-samples from kernel herding. In: The Twenty-Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-2010), pp. 109–116. AUAI Press (2010)Google Scholar
  7. Cortes, C., Kloft, M., Mohri, M.: Learning kernels using local Rademacher complexity. In: Advances in Neural Information Processing Systems (2013)Google Scholar
  8. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7(2), 179–188 (1936)CrossRefGoogle Scholar
  9. Flaxman, S., Sejdinovic, D., Cunningham, J.P., Filippi, S.: Bayesian learning of kernel embeddings. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 182–191. AUAI Press (2016)Google Scholar
  10. Freire, A.L., Barreto, G.A., Veloso, M., Varela, A.T.: Short-term memory mechanisms in neural network learning of robot navigation tasks: a case study. In: 6th Latin American Robotics Symposium (LARS) (2009)Google Scholar
  11. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, New York (2001). Scholar
  12. Fukumizu, K., Bach, F.R., Jordan, M.I.: Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 5, 73–99 (2004)MathSciNetzbMATHGoogle Scholar
  13. Fukumizu, K., Gretton, A., Lanckriet, G.R., Schölkopf, B., Sriperumbudur, B.K.: Kernel choice and classifiability for RKHS embeddings of probability distributions. In: Advances in Neural Information Processing Systems (2009)Google Scholar
  14. Fukumizu, K., Song, L., Gretton, A.: Kernel Bayes’ rule: Bayesian inference with positive definite kernels. J. Mach. Learn. Res. 14(1), 3753–3783 (2013)MathSciNetzbMATHGoogle Scholar
  15. Gretton, A., Borgwardt, K.M., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems, pp. 513–520 (2007)Google Scholar
  16. Gretton, A., et al.: Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems (2012)Google Scholar
  17. Grünewälder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M.: Conditional mean embeddings as regressors. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, vol. 2 (2012)Google Scholar
  18. Higham, N.J.: Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia (2002)CrossRefGoogle Scholar
  19. Horton, P., Nakai, K.: A probabilistic classification system for predicting the cellular localization sites of proteins. In: ISMB, vol. 4 (1996)Google Scholar
  20. Jitkrittum, W., Szabó, Z., Chwialkowski, K.P., Gretton, A.: Interpretable distribution features with maximum testing power. In: Advances in Neural Information Processing Systems, pp. 181–189 (2016)Google Scholar
  21. Kanagawa, M., Fukumizu, K.: Recovering distributions from Gaussian RKHS embeddings. In: AISTATS, pp. 457–465 (2014)Google Scholar
  22. Kanagawa, M., Nishiyama, Y., Gretton, A., Fukumizu, K.: Filtering with state-observation examples via kernel Monte Carlo filter. Neural Comput. 28(2), 382–444 (2016)MathSciNetCrossRefGoogle Scholar
  23. Kaya, E., Yasar, A., Saritas, I.: Banknote classification using artificial neural network approach. Int. J. Intell. Syst. Appl. Eng. 4(1), 16–19 (2016)CrossRefGoogle Scholar
  24. Kearns, M., Mansour, Y., Ng, A.Y., Ron, D.: An experimental and theoretical comparison of model selection methods. Mach. Learn. 27(1), 7–50 (1997)CrossRefGoogle Scholar
  25. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: The International Conference on Learning Representations (ICLR) (2016)Google Scholar
  26. Kloft, M., Blanchard, G.: The local Rademacher complexity of lp-norm multiple kernel learning. In: Advances in Neural Information Processing Systems, pp. 2438–2446 (2011)Google Scholar
  27. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  28. Ledoux, M., Talagrand, M.: Probability in Banach Spaces (2013)Google Scholar
  29. Muandet, K., Balduzzi, D., Schölkopf, B.: Domain generalization via invariant feature representation. In: ICML (1), pp. 10–18 (2013)Google Scholar
  30. Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B.: Kernel mean embedding of distributions: a review and beyonds. arXiv:1605.09522 [stat.ML] (2016)
  31. Pahikkala, T., Airola, A., Gieseke, F., Kramer, O.: Unsupervised multi-class regularized least-squares classification. In: IEEE 12th International Conference on Data Mining (ICDM), pp. 585–594. IEEE (2012)Google Scholar
  32. Pontil, M., Maurer, A.: Excess risk bounds for multitask learning with trace norm regularization. In: Conference on Learning Theory, pp. 55–76 (2013)Google Scholar
  33. Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2008)Google Scholar
  34. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge (2006)zbMATHGoogle Scholar
  35. Rifkin, R., Yeo, G., Poggio, T., et al.: Regularized least-squares classification. In: Nato Science Series Sub Series III Computer and Systems Sciences (2003)Google Scholar
  36. Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2002)Google Scholar
  37. Song, L., Fukumizu, K., Gretton, A.: Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models. IEEE Signal Process. Mag. 30(4), 98–111 (2013)CrossRefGoogle Scholar
  38. Song, L., Huang, J., Smola, A., Fukumizu, K.: Hilbert space embeddings of conditional distributions with applications to dynamical systems. In: Proceedings of the 26th Annual International Conference on Machine Learning (2009)Google Scholar
  39. Song, L., Zhang, X., Smola, A., Gretton, A., Schölkopf, B.: Tailoring density estimation via reproducing kernel moment matching. In: Proceedings of the 25th International Conference on Machine Learning, pp. 992–999. ACM (2008)Google Scholar
  40. Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Schölkopf, B., Lanckriet, G.R.: Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11, 1517–1561 (2010)MathSciNetzbMATHGoogle Scholar
  41. Xu, C., Liu, T., Tao, D., Xu, C.: Local Rademacher complexity for multi-label learning. IEEE Trans. Image Process. 25(3), 1495–1507 (2016)MathSciNetCrossRefGoogle Scholar
  42. Xu, Y., Zhang, H.: Refinement of reproducing kernels. J. Mach. Learn. Res. 10, 107–140 (2009)MathSciNetzbMATHGoogle Scholar
  43. Yu, H.-f., Jain, P., Kar, P., Dhillon, I.: Large-scale multi-label learning with missing labels. In: Proceedings of the 31st International Conference on Machine Learning (ICML-2014), pp. 593–601 (2014)Google Scholar
  44. Zhou, Z.-H., Wei, D., Li, G., Dai, H.: On the size of training set and the benefit from ensemble. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 298–307. Springer, Heidelberg (2004). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of SydneySydneyAustralia
  2. 2.Data61, CSIROSydneyAustralia
  3. 3.Australian National UniversityCanberraAustralia

Personalised recommendations