Machine Learning

, Volume 85, Issue 1–2, pp 77–108 | Cite as

SpicyMKL: a fast algorithm for Multiple Kernel Learning with thousands of kernels

Article

Abstract

We propose a new optimization algorithm for Multiple Kernel Learning (MKL) called SpicyMKL, which is applicable to general convex loss functions and general types of regularization. The proposed SpicyMKL iteratively solves smooth minimization problems. Thus, there is no need of solving SVM, LP, or QP internally. SpicyMKL can be viewed as a proximal minimization method and converges super-linearly. The cost of inner minimization is roughly proportional to the number of active kernels. Therefore, when we aim for a sparse kernel combination, our algorithm scales well against increasing number of kernels. Moreover, we give a general block-norm formulation of MKL that includes non-sparse regularizations, such as elastic-net and p-norm regularizations. Extending SpicyMKL, we propose an efficient optimization method for the general regularization framework. Experimental results show that our algorithm is faster than existing methods especially when the number of kernels is large (>1000).

Keywords

Multiple kernel learning Sparsity Non-smooth optimization Super-linear convergence Proximal minimization 

References

  1. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. MathSciNetMATHCrossRefGoogle Scholar
  2. Asuncion, A., & Newman, D. (2007). UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html.
  3. Bach, F. R. (2008). Consistency of the group Lasso and multiple kernel learning. Journal of Machine Learning Research, 9, 1179–1225. MathSciNetGoogle Scholar
  4. Bach, F. R., Lanckriet, G., & Jordan, M. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the 21st international conference on machine learning (pp. 41–48). Google Scholar
  5. Bach, F. R., Thibaux, R., & Jordan, M. I. (2005). Computing regularization paths for learning multiple kernels. In Advances in neural information processing systems (Vol. 17, pp. 73–80). Cambridge: MIT Press. Google Scholar
  6. Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. New York: Academic Press. MATHGoogle Scholar
  7. Bertsekas, D. P. (1999). Nonlinear programming. Nashua: Athena Scientific. MATHGoogle Scholar
  8. Candes, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2), 489–509. MathSciNetCrossRefGoogle Scholar
  9. Chapelle, O., & Rakotomamonjy, A. (2008). Second order optimization of kernel parameters. In NIPS workshop on kernel learning: automatic selection of optimal kernels, Whistler. Google Scholar
  10. Cortes, C. (2009). Can learning kernels help performance? Invited talk at International Conference on Machine Learning (ICML 2009), Montréal, Canada. Google Scholar
  11. Cortes, C., Mohri, M., & Rostamizadeh, A. (2009). L 2 regularization for learning kernels. In Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI 2009), Montréal, Canada. Google Scholar
  12. Daubechies, I., Defrise, M., & Mol, C. D. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, LVII, 1413–1457. CrossRefGoogle Scholar
  13. Figueiredo, M., & Nowak, R. (2003). An EM algorithm for wavelet-based image restoration. IEEE Transactions on Image Processing, 12, 906–916. MathSciNetCrossRefGoogle Scholar
  14. Gehler, P. V., & Nowozin, S. (2009). Let the kernel figure it out; principled learning of pre-processing for kernel classifiers. In Proceedings of the IEEE computer society conference on computer vision and pattern (CVPR2009). Google Scholar
  15. Hestenes, M. (1969). Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4, 303–320. MathSciNetMATHCrossRefGoogle Scholar
  16. Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95. MathSciNetMATHCrossRefGoogle Scholar
  17. Kloft, M., Brefeld, U., Sonnenburg, S., Laskov, P., Müller, K. R., & Zien, A. (2009). Efficient and accurate p-norm multiple kernel learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 997–1005). Cambridge: MIT Press. Google Scholar
  18. Kloft, M., Rückert, U., & Bartlett, P. L. (2010). A unifying view of multiple kernel learning. arXiv:1005.0437.
  19. Lanckriet, G., Cristianini, N., Ghaoui, L. E., Bartlett, P., & Jordan, M. (2004). Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5, 27–72. Google Scholar
  20. Micchelli, C. A., & Pontil, M. (2005). Learning the kernel function via regularization. Journal of Machine Learning Research, 6, 1099–1125. MathSciNetGoogle Scholar
  21. Mosci, S., Santoro, M., Verri, A., & Villa, S. (2008). A new algorithm to learn an optimal kernel based on Fenchel duality. In NIPS 2008 workshop: kernel learning: automatic selection of optimal kernels, Whistler. Google Scholar
  22. Nath, J. S., Dinesh, G., Raman, S., Bhattacharyya, C., Ben-Tal, A., & Ramakrishnan, K. R. (2009). On the algorithmics and applications of a mixed-norm based kernel learning formulation. In Advances in neural information processing systems (Vol. 22, pp. 844–852). Cambridge: MIT Press. Google Scholar
  23. Palmer, J., Wipf, D., Kreutz-Delgado, K., & Rao, B. (2006). Variational EM algorithms for non-Gaussian latent variable models. In Y. Weiss, B. Schölkopf, & J. Platt (Eds.), Advances in neural information processing systems (Vol. 18, pp. 1059–1066). Cambridge: MIT Press. Google Scholar
  24. Platt, J. C. (1999). Using sparseness and analytic QP to speed training of support vector machines. In Advances in neural information processing systems (Vol. 11, pp. 557–563). Cambridge: MIT Press. Google Scholar
  25. Powell, M. (1969). A method for nonlinear constraints in minimization problems. In R. Fletcher (Ed.), Optimization (pp. 283–298). London: Academic Press. Google Scholar
  26. Rakotomamonjy, A., Bach, F., & Canu, S. Y. G. (2008). SimpleMKL. Journal of Machine Learning Research, 9, 2491–2521. MathSciNetGoogle Scholar
  27. Rätsch, G., Onoda, T., & Müller, K. R. (2001). Soft margins for adaboost. Machine Learning, 42(3), 287–320. MATHCrossRefGoogle Scholar
  28. Rockafellar, R. T. (1970). Convex analysis. Princeton: Princeton University Press. MATHGoogle Scholar
  29. Rockafellar, R. T. (1976). Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1, 97–116. MathSciNetMATHCrossRefGoogle Scholar
  30. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge: MIT Press. Google Scholar
  31. Sonnenburg, S., Rätsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531–1565. Google Scholar
  32. Tomioka, R., & Sugiyama, M. (2009). Dual augmented lagrangian method for efficient sparse reconstruction. IEEE Signal Processing Letters, 16(12), 1067–1070. CrossRefGoogle Scholar
  33. Tomioka, R., & Suzuki, T. (2009). Sparsity-accuracy trade-off in MKL. arXiv:1001.2615.
  34. Tomioka, R., & Suzuki, T. (2011). Regularization strategies and empirical Bayesian learning for MKL. arXiv:1011.3090.
  35. Tomioka, R., Suzuki, T., & Sugiyama, M. (2011). Super-linear convergence of dual augmented lagrangian algorithm for sparse learning. Journal of Machine Learning Research, 12, 1501–1550. Google Scholar
  36. Wright, S. J., Nowak, R. D., & Figueiredo, M. A. T. (2009). Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7), 2479–2493. doi:10.1109/TSP.2009.2016892. MathSciNetCrossRefGoogle Scholar
  37. Xu, Z., Jin, R., King, I., & Lyu, M. R. (2009). An extended level method for efficient multiple kernel learning. In Advances in neural information processing systems (Vol. 21, pp. 1825–1832). Cambridge: MIT Press. Google Scholar
  38. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1), 49–67. MathSciNetMATHCrossRefGoogle Scholar
  39. Zangwill, W. I. (1969). Nonlinear programming: a unified approach. New York: Prentice Hall. MATHGoogle Scholar
  40. Zien, A., & Ong, C. (2007). Multiclass multiple kernel learning. In Proceedings of the 24th international conference on machine learning (pp. 11910–1198). New York: ACM. Google Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  1. 1.Department of Mathematical InformaticsThe University of TokyoBunkyo-ku, TokyoJapan

Personalised recommendations