Machine Learning

, Volume 91, Issue 1, pp 43–66 | Cite as

Learning with infinitely many features



We propose a principled framework for learning with infinitely many features, situations that are usually induced by continuously parametrized feature extraction methods. Such cases occur for instance when considering Gabor-based features in computer vision problems or when dealing with Fourier features for kernel approximations. We cast the problem as the one of finding a finite subset of features that minimizes a regularized empirical risk. After having analyzed the optimality conditions of such a problem, we propose a simple algorithm which has the flavour of a column-generation technique. We also show that using Fourier-based features, it is possible to perform approximate infinite kernel learning. Our experimental results on several datasets show the benefits of the proposed approach in several situations including texture classification and large-scale kernelized problems (involving about 100 thousand examples).


Infinite features Column generation Gabor features Kernels 



This work was supported in part by the IST Program of the European Community, under the PASCAL2 Network of Excellence, IST-216886. This publication only reflects the authors views. This work was partly supported by grants from the ASAP ANR-09-EMER-001, ANR JCJC-12 Lemon and ANR Blanc-12 Greta.


  1. Argyriou, A., Micchelli, C., & Pontil, M. (2005). Learning convex combinations of continuously parameterized basic kernels. In Proceedings of COLT’2005 (pp. 338–352). Google Scholar
  2. Argyriou, A., Hauser, R., Micchelli, C., & Pontil, M. (2006). A dc-programming algorithm for kernel selection. In Proc. international conference in machine learning. Google Scholar
  3. Argyriou, A., Evgeniou, T., & Pontil, M. (2008a). Convex multi-task feature learning. Machine Learning, 73(3), 243–272. CrossRefGoogle Scholar
  4. Argyriou, A., Maurer, A., & Pontil, M. (2008b). An algorithm for transfer learning in a heterogeneous environment. In Proceedings of the European conference on machine learning. Google Scholar
  5. Argyriou, A., Micchelli, C., & Pontil, M. (2009). When is there a representer theorem? Vector versus matrix regularizers. Journal of Machine Learning Research, 10, 2507–2529. MathSciNetMATHGoogle Scholar
  6. Bach, F., & Jordan, M. (2005). Predictive low-rank decomposition for kernel methods. In Proceedings of the 22nd international conference on machine learning. Google Scholar
  7. Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183–202. MathSciNetMATHCrossRefGoogle Scholar
  8. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. MathSciNetMATHCrossRefGoogle Scholar
  9. Bianconi, F., & Fernandez, A. (2007). Evaluation of the effects of Gabor filter parameters on texture classification. Pattern Recognition, 40(12), 3325–3335. MATHCrossRefGoogle Scholar
  10. Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005). Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6, 1579–1619. MathSciNetMATHGoogle Scholar
  11. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3, 1–122. CrossRefGoogle Scholar
  12. Chapelle, O.P.S., Vadrevu, S., Weinberger, K., Zhang, Y., & Tseng, B. (2011). Boosted multi-task learning. Machine Learning, 85(1–2), 149–173. CrossRefGoogle Scholar
  13. Chen, X., Pan, W., Kwok, J., & Carbonell, J. (2009). Accelerated gradient method for multi-task sparse learning problem. In Proceedings of the international conference on data mining. Google Scholar
  14. Choi, W.-P., Tse, S.-H., Wong, K.-W., & Lam, K.-M. (2008). Simplified Gabor wavelets for human face recognition. Pattern Recognition, 41(3), 1186–1199. MATHCrossRefGoogle Scholar
  15. Clausi, D., & Jernigan, M. (2000). Designing Gabor filters for optimal texture separability. Pattern Recognition, 33(11), 1835–1849. CrossRefGoogle Scholar
  16. Cortes, C., Mohri, M., & Rostamizadeh, A. (2010). Generalization bounds for learning kernels. In Proceedings of the 27th annual international conference on machine learning. Google Scholar
  17. Demiriz, A., Bennett, K., & Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46, 225–254. MATHCrossRefGoogle Scholar
  18. Eckstein, J., & Bertsekas, D. (1992). On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 5, 293–318. MathSciNetCrossRefGoogle Scholar
  19. Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. In Proceedings of the tenth conference on knowledge discovery and data mining. Google Scholar
  20. Gehler, P., & Nowozin, S. (2008). Infinite kernel learning. In NIPS workshop on automatic selection of kernel parameters. Google Scholar
  21. Gehler, P., & Nowozin, S. (2009). Let the kernel figure it out: principled learning of pre-processing for kernel classifiers. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition. Google Scholar
  22. Jalali, A., Ravikumar, P., Sanghavi, S., & Ruan, C. (2010). A dirty model for multitask learning. In Advances in neural information processing systems (NIPS). Google Scholar
  23. Lanckriet, G., Cristianini, N., El Ghaoui, L., Bartlett, P., & Jordan, M. (2004). Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5, 27–72. MathSciNetMATHGoogle Scholar
  24. Lee, H., Largman, Y., Pham, P., & Ng, A. (2009). Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in neural information and processing systems. Google Scholar
  25. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Boosting algorithms as gradient descent in function space. In Neural information processing systems. Google Scholar
  26. Mota, J., Xavier, J., Aguiar, P., & Puschel, M. (2011). A proof of convergence for the alternating direction method of multipliers applied to polyhedral-constrained functions (Technical report). arXiv:1112.2295.
  27. Neumann, J., Schnorr, C., & Steidl, G. (2005). Efficient wavelet adaptation for hybrid wavelet-large margin classifiers. Pattern Recognition, 38(11), 1815–1830. MATHCrossRefGoogle Scholar
  28. Nocedal, J., & Wright, S. (2000). Numerical optimization. Berlin: Springer. Google Scholar
  29. Obozinski, G., Taskar, B., & Jordan, M. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20, 231–252. MathSciNetCrossRefGoogle Scholar
  30. Ozogur-Akyuz, S., & Weber, G. (2008). Learning with infinitely many kernels via semi-infinite programming. In Proceedings of Euro mini conference on “Continuous optimization and knowledge based technologies” (pp. 342–348). Google Scholar
  31. Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In Advances in neural information processing systems (NIPS) (pp. 1177–1184). Google Scholar
  32. Rakotomamonjy, A., Flamary, R., Gasso, G., & Canu, S. (2011). p q penalty for sparse linear and sparse multiple kernel multi-task learning. IEEE Transactions on Neural Networks, 22(8), 1307–1320. CrossRefGoogle Scholar
  33. Rosset, S., Swirszcz, G., Srebro, N., & Zhu, J. (2007). 1 regularization in infinite dimensional feature spaces. In Proceedings of computational learning theory (pp. 544–558). Google Scholar
  34. Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., & Poggio, T. (2007). Object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3), 411–426. CrossRefGoogle Scholar
  35. Shen, C., & Li, H. (2010). On the dual formulation of boosting algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12), 2216–2231. MathSciNetCrossRefGoogle Scholar
  36. Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In Proceedings of the IEEE conf. on computer vision and pattern recognition (CVPR). Google Scholar
  37. Wang, J., Lu, H., Plataniotis, K., & Lu, J. (2009). Gaussian kernel optimization for pattern recognition. Pattern Recognition, 42(7), 1237–1247. MATHCrossRefGoogle Scholar
  38. Yeung, D.-Y., Chang, H., & Dai, G. (2007). Learning the kernel matrix by maximizing a kfd-based class separability criterion. Pattern Recognition, 40(7), 2021–2028. MATHCrossRefGoogle Scholar
  39. Yger, F., & Rakotomamonjy, A. (2011). Wavelet kernel learning. Pattern Recognition, 44, 2614–2629. CrossRefGoogle Scholar
  40. Ying, Y., & Campbell, C. (2009). Generalization bounds for learning the kernel. In Proceedings of 22nd annual conference on learning theory (COLT). Google Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.LITIS, EA 4108Université/INSA de RouenSaint Etienne du RouvrayFrance
  2. 2.Laboratoire Lagrange, UMR CNRS 7293, Observatoire de la Côte d’AzurUniversité de Nice Sophia-AntipolisNiceFrance

Personalised recommendations