Large-Scale Nonlinear Variable Selection via Kernel Random Features

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11052)


We propose a new method for input variable selection in nonlinear regression. The method is embedded into a kernel regression machine that can model general nonlinear functions, not being a priori limited to additive models. This is the first kernel-based variable selection method applicable to large datasets. It sidesteps the typical poor scaling properties of kernel methods by mapping the inputs into a relatively low-dimensional space of random features. The algorithm discovers the variables relevant for the regression task together with learning the prediction model through learning the appropriate nonlinear random feature maps. We demonstrate the outstanding performance of our method on a set of large-scale synthetic and real datasets. Code related to this paper is available at:



This work was partially supported by the research projects HSTS (ISNET) and RAWFIE #645220 (H2020). The computations were performed at University of Geneva on the Baobab and Whales clusters. We specifically wish to thank Yann Sagon, the Baobab administrator, for his excellent work and continuous support.


  1. 1.
    Allen, G.I.: Automatic feature selection via weighted kernels and regularization. J. Comput. Graph. Stat. 22(2), 284–299 (2013)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Bach, F.: Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 9, 1179–1225 (2008)MathSciNetzbMATHGoogle Scholar
  3. 3.
    Bach, F.: High-dimensional non-linear variable selection through hierarchical kernel learning. ArXiv arXiv:0909.0844 (2009)
  4. 4.
    Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)CrossRefGoogle Scholar
  6. 6.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Recent advances and emerging challenges of feature selection in the context of big data. Knowl. Based Syst. 86, 33–45 (2015)CrossRefGoogle Scholar
  7. 7.
    Chan, A.B., Vasconcelos, N., Lanckriet, G.R.G.: Direct convex relaxations of sparse SVM. In: International Conference on Machine Learning (2007)Google Scholar
  8. 8.
    Chen, J., Stern, M., Wainwright, M.J., Jordan, M.I.: Kernel feature selection via conditional covariance minimization. In: Advances in Neural Information Processing Systems (NIPS) (2017)Google Scholar
  9. 9.
    Fukumizu, K., Leng, C.: Gradient-based kernel method for feature extraction and variable selection. In: Advances in Neural Information Processing Systems (NIPS) (2012)Google Scholar
  10. 10.
    Grandvalet, Y., Canu, S.: Adaptive scaling for feature selection in SVMs. In: Advances in Neural Information Processing Systems (NIPS) (2002)Google Scholar
  11. 11.
    Gregorová, M., Kalousis, A., Marchand-Maillet, S.: Structured nonlinear variable selection. In: Conference on Uncertainty in Artificial Intelligence (UAI) (2018)Google Scholar
  12. 12.
    Gretton, A., Fukumizu, K., Teo, C.H., Song, L., Schölkopf, B., Smola, A.J.: A kernel statistical test of independence. In: Advances in Neural Information Processing Systems (NIPS) (2008)Google Scholar
  13. 13.
    Gurram, P., Kwon, H.: Optimal sparse kernel learning in the empirical kernel feature space for hyperspectral classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 7(4), 1217–1226 (2014)CrossRefGoogle Scholar
  14. 14.
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)CrossRefGoogle Scholar
  15. 15.
    Hastie, T., Tibshirani, R.: Generalized Additive Models. Chapman and Hall, London (1990)zbMATHGoogle Scholar
  16. 16.
    Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)CrossRefGoogle Scholar
  17. 17.
    Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations (ICLR) (2014)Google Scholar
  18. 18.
    Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)CrossRefGoogle Scholar
  19. 19.
    Koltchinskii, V., Yuan, M.: Sparsity in multiple kernel learning. Ann. Stat. 38(6), 3660–3695 (2010)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Lin, Y., Zhang, H.H.: Component selection and smoothing in multivariate nonparametric regression. Ann. Stat. 34(5), 2272–2297 (2006)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Maldonado, S., Weber, R., Basak, J.: Simultaneous feature selection and classification using kernel-penalized support vector machines. Inf. Sci. 181(1), 115–128 (2011)CrossRefGoogle Scholar
  22. 22.
    Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S.: Solving structured sparsity regularization with proximal methods. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6322, pp. 418–433. Springer, Heidelberg (2010). Scholar
  23. 23.
    Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B.: Kernel mean embedding of distributions: a review and beyond. Found. Trends Mach. Learn. 10(1–2), 1–141 (2017)CrossRefGoogle Scholar
  24. 24.
    Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems (NIPS) (2007)Google Scholar
  25. 25.
    Rakotomamonjy, A.: Variable selection using SVM-based criteria. J. Mach. Learn. Res. 3, 1357–1370 (2003)MathSciNetzbMATHGoogle Scholar
  26. 26.
    Ravikumar, P., Liu, H., Lafferty, J., Wasserman, L.: Spam: sparse additive models. In: Advances in Neural Information Processing Systems (NIPS) (2007)Google Scholar
  27. 27.
    Ren, S., Huang, S., Onofrey, J.A., Papademetris, X., Qian, X.: A scalable algorithm for structured kernel feature selection. In: Aistats (2015)Google Scholar
  28. 28.
    Rosasco, L., Villa, S., Mosci, S.: Nonparametric sparsity and regularization. J. Mach. Learn. Res. 14(1), 1665–1714 (2013)MathSciNetzbMATHGoogle Scholar
  29. 29.
    Schölkopf, B., Smola, A.J.: Learning with Kernels. The MIT Press, Cambridge (2002)zbMATHGoogle Scholar
  30. 30.
    Song, L., Smola, A., Gretton, A., Borgwardt, K.M., Bedo, J.: Supervised feature selection via dependence estimation. In: Proceedings of the 24th International Conference on Machine Learning - ICML 2007 (2007)Google Scholar
  31. 31.
    Tyagi, H., Krause, A., Eth, Z.: Efficient sampling for learning sparse additive models in high dimensions. In: International Conference on Artificial Intelligence and Statistics (AISTATS) (2016)Google Scholar
  32. 32.
    Weston, J., Elisseeff, A., Scholkopf, B., Tipping, M.: Use of the zero-norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Yamada, M., Jitkrittum, W., Sigal, L., Xing, E.P., Sugiyama, M.: High-dimensional feature selection by feature-wise kernelized lasso. Neural Comput. 26(1), 185–207 (2014)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Yin, J., Chen, X., Xing, E.P.: Group sparse additive models. In: International Conference on Machine Learning (ICML) (2012)Google Scholar
  35. 35.
    Zhao, T., Li, X., Liu, H., Roeder, K.: CRAN - Package SAM (2014)Google Scholar
  36. 36.
    Zhou, D.X.: Derivative reproducing properties for kernel methods in learning theory. J. Comput. Appl. Math. 220(1–2), 456–463 (2008)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Geneva School of Business AdministrationHES-SOGenevaSwitzerland
  2. 2.University of GenevaGenevaSwitzerland

Personalised recommendations