SpicyMKL: a fast algorithm for Multiple Kernel Learning with thousands of kernels

Abstract

We propose a new optimization algorithm for Multiple Kernel Learning (MKL) called SpicyMKL, which is applicable to general convex loss functions and general types of regularization. The proposed SpicyMKL iteratively solves smooth minimization problems. Thus, there is no need of solving SVM, LP, or QP internally. SpicyMKL can be viewed as a proximal minimization method and converges super-linearly. The cost of inner minimization is roughly proportional to the number of active kernels. Therefore, when we aim for a sparse kernel combination, our algorithm scales well against increasing number of kernels. Moreover, we give a general block-norm formulation of MKL that includes non-sparse regularizations, such as elastic-net and p -norm regularizations. Extending SpicyMKL, we propose an efficient optimization method for the general regularization framework. Experimental results show that our algorithm is faster than existing methods especially when the number of kernels is large (>1000).

References

  1. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404.

    MathSciNet  MATH  Article  Google Scholar 

  2. Asuncion, A., & Newman, D. (2007). UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html.

  3. Bach, F. R. (2008). Consistency of the group Lasso and multiple kernel learning. Journal of Machine Learning Research, 9, 1179–1225.

    MathSciNet  Google Scholar 

  4. Bach, F. R., Lanckriet, G., & Jordan, M. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the 21st international conference on machine learning (pp. 41–48).

    Google Scholar 

  5. Bach, F. R., Thibaux, R., & Jordan, M. I. (2005). Computing regularization paths for learning multiple kernels. In Advances in neural information processing systems (Vol. 17, pp. 73–80). Cambridge: MIT Press.

    Google Scholar 

  6. Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. New York: Academic Press.

    Google Scholar 

  7. Bertsekas, D. P. (1999). Nonlinear programming. Nashua: Athena Scientific.

    Google Scholar 

  8. Candes, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2), 489–509.

    MathSciNet  Article  Google Scholar 

  9. Chapelle, O., & Rakotomamonjy, A. (2008). Second order optimization of kernel parameters. In NIPS workshop on kernel learning: automatic selection of optimal kernels, Whistler.

    Google Scholar 

  10. Cortes, C. (2009). Can learning kernels help performance? Invited talk at International Conference on Machine Learning (ICML 2009), Montréal, Canada.

  11. Cortes, C., Mohri, M., & Rostamizadeh, A. (2009). L 2 regularization for learning kernels. In Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI 2009), Montréal, Canada.

    Google Scholar 

  12. Daubechies, I., Defrise, M., & Mol, C. D. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, LVII, 1413–1457.

    Article  Google Scholar 

  13. Figueiredo, M., & Nowak, R. (2003). An EM algorithm for wavelet-based image restoration. IEEE Transactions on Image Processing, 12, 906–916.

    MathSciNet  Article  Google Scholar 

  14. Gehler, P. V., & Nowozin, S. (2009). Let the kernel figure it out; principled learning of pre-processing for kernel classifiers. In Proceedings of the IEEE computer society conference on computer vision and pattern (CVPR2009).

    Google Scholar 

  15. Hestenes, M. (1969). Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4, 303–320.

    MathSciNet  MATH  Article  Google Scholar 

  16. Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95.

    MathSciNet  MATH  Article  Google Scholar 

  17. Kloft, M., Brefeld, U., Sonnenburg, S., Laskov, P., Müller, K. R., & Zien, A. (2009). Efficient and accurate p -norm multiple kernel learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 997–1005). Cambridge: MIT Press.

    Google Scholar 

  18. Kloft, M., Rückert, U., & Bartlett, P. L. (2010). A unifying view of multiple kernel learning. arXiv:1005.0437.

  19. Lanckriet, G., Cristianini, N., Ghaoui, L. E., Bartlett, P., & Jordan, M. (2004). Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5, 27–72.

    Google Scholar 

  20. Micchelli, C. A., & Pontil, M. (2005). Learning the kernel function via regularization. Journal of Machine Learning Research, 6, 1099–1125.

    MathSciNet  Google Scholar 

  21. Mosci, S., Santoro, M., Verri, A., & Villa, S. (2008). A new algorithm to learn an optimal kernel based on Fenchel duality. In NIPS 2008 workshop: kernel learning: automatic selection of optimal kernels, Whistler.

    Google Scholar 

  22. Nath, J. S., Dinesh, G., Raman, S., Bhattacharyya, C., Ben-Tal, A., & Ramakrishnan, K. R. (2009). On the algorithmics and applications of a mixed-norm based kernel learning formulation. In Advances in neural information processing systems (Vol. 22, pp. 844–852). Cambridge: MIT Press.

    Google Scholar 

  23. Palmer, J., Wipf, D., Kreutz-Delgado, K., & Rao, B. (2006). Variational EM algorithms for non-Gaussian latent variable models. In Y. Weiss, B. Schölkopf, & J. Platt (Eds.), Advances in neural information processing systems (Vol. 18, pp. 1059–1066). Cambridge: MIT Press.

    Google Scholar 

  24. Platt, J. C. (1999). Using sparseness and analytic QP to speed training of support vector machines. In Advances in neural information processing systems (Vol. 11, pp. 557–563). Cambridge: MIT Press.

    Google Scholar 

  25. Powell, M. (1969). A method for nonlinear constraints in minimization problems. In R. Fletcher (Ed.), Optimization (pp. 283–298). London: Academic Press.

    Google Scholar 

  26. Rakotomamonjy, A., Bach, F., & Canu, S. Y. G. (2008). SimpleMKL. Journal of Machine Learning Research, 9, 2491–2521.

    MathSciNet  Google Scholar 

  27. Rätsch, G., Onoda, T., & Müller, K. R. (2001). Soft margins for adaboost. Machine Learning, 42(3), 287–320.

    MATH  Article  Google Scholar 

  28. Rockafellar, R. T. (1970). Convex analysis. Princeton: Princeton University Press.

    Google Scholar 

  29. Rockafellar, R. T. (1976). Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1, 97–116.

    MathSciNet  MATH  Article  Google Scholar 

  30. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge: MIT Press.

    Google Scholar 

  31. Sonnenburg, S., Rätsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531–1565.

    Google Scholar 

  32. Tomioka, R., & Sugiyama, M. (2009). Dual augmented lagrangian method for efficient sparse reconstruction. IEEE Signal Processing Letters, 16(12), 1067–1070.

    Article  Google Scholar 

  33. Tomioka, R., & Suzuki, T. (2009). Sparsity-accuracy trade-off in MKL. arXiv:1001.2615.

  34. Tomioka, R., & Suzuki, T. (2011). Regularization strategies and empirical Bayesian learning for MKL. arXiv:1011.3090.

  35. Tomioka, R., Suzuki, T., & Sugiyama, M. (2011). Super-linear convergence of dual augmented lagrangian algorithm for sparse learning. Journal of Machine Learning Research, 12, 1501–1550.

    Google Scholar 

  36. Wright, S. J., Nowak, R. D., & Figueiredo, M. A. T. (2009). Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7), 2479–2493. doi:10.1109/TSP.2009.2016892.

    MathSciNet  Article  Google Scholar 

  37. Xu, Z., Jin, R., King, I., & Lyu, M. R. (2009). An extended level method for efficient multiple kernel learning. In Advances in neural information processing systems (Vol. 21, pp. 1825–1832). Cambridge: MIT Press.

    Google Scholar 

  38. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1), 49–67.

    MathSciNet  MATH  Article  Google Scholar 

  39. Zangwill, W. I. (1969). Nonlinear programming: a unified approach. New York: Prentice Hall.

    Google Scholar 

  40. Zien, A., & Ong, C. (2007). Multiclass multiple kernel learning. In Proceedings of the 24th international conference on machine learning (pp. 11910–1198). New York: ACM.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Taiji Suzuki.

Additional information

Editors: Süreyya Özöǧür-Akyüz, Dervim Ünay, Alex Smola.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Suzuki, T., Tomioka, R. SpicyMKL: a fast algorithm for Multiple Kernel Learning with thousands of kernels. Mach Learn 85, 77–108 (2011). https://doi.org/10.1007/s10994-011-5252-9

Download citation

Keywords

  • Multiple kernel learning
  • Sparsity
  • Non-smooth optimization
  • Super-linear convergence
  • Proximal minimization