Abstract
We propose a new optimization algorithm for Multiple Kernel Learning (MKL) called SpicyMKL, which is applicable to general convex loss functions and general types of regularization. The proposed SpicyMKL iteratively solves smooth minimization problems. Thus, there is no need of solving SVM, LP, or QP internally. SpicyMKL can be viewed as a proximal minimization method and converges super-linearly. The cost of inner minimization is roughly proportional to the number of active kernels. Therefore, when we aim for a sparse kernel combination, our algorithm scales well against increasing number of kernels. Moreover, we give a general block-norm formulation of MKL that includes non-sparse regularizations, such as elastic-net and ℓ p -norm regularizations. Extending SpicyMKL, we propose an efficient optimization method for the general regularization framework. Experimental results show that our algorithm is faster than existing methods especially when the number of kernels is large (>1000).
Article PDF
Similar content being viewed by others
References
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404.
Asuncion, A., & Newman, D. (2007). UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html.
Bach, F. R. (2008). Consistency of the group Lasso and multiple kernel learning. Journal of Machine Learning Research, 9, 1179–1225.
Bach, F. R., Lanckriet, G., & Jordan, M. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the 21st international conference on machine learning (pp. 41–48).
Bach, F. R., Thibaux, R., & Jordan, M. I. (2005). Computing regularization paths for learning multiple kernels. In Advances in neural information processing systems (Vol. 17, pp. 73–80). Cambridge: MIT Press.
Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. New York: Academic Press.
Bertsekas, D. P. (1999). Nonlinear programming. Nashua: Athena Scientific.
Candes, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2), 489–509.
Chapelle, O., & Rakotomamonjy, A. (2008). Second order optimization of kernel parameters. In NIPS workshop on kernel learning: automatic selection of optimal kernels, Whistler.
Cortes, C. (2009). Can learning kernels help performance? Invited talk at International Conference on Machine Learning (ICML 2009), Montréal, Canada.
Cortes, C., Mohri, M., & Rostamizadeh, A. (2009). L 2 regularization for learning kernels. In Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI 2009), Montréal, Canada.
Daubechies, I., Defrise, M., & Mol, C. D. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, LVII, 1413–1457.
Figueiredo, M., & Nowak, R. (2003). An EM algorithm for wavelet-based image restoration. IEEE Transactions on Image Processing, 12, 906–916.
Gehler, P. V., & Nowozin, S. (2009). Let the kernel figure it out; principled learning of pre-processing for kernel classifiers. In Proceedings of the IEEE computer society conference on computer vision and pattern (CVPR2009).
Hestenes, M. (1969). Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4, 303–320.
Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95.
Kloft, M., Brefeld, U., Sonnenburg, S., Laskov, P., Müller, K. R., & Zien, A. (2009). Efficient and accurate ℓ p -norm multiple kernel learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 997–1005). Cambridge: MIT Press.
Kloft, M., Rückert, U., & Bartlett, P. L. (2010). A unifying view of multiple kernel learning. arXiv:1005.0437.
Lanckriet, G., Cristianini, N., Ghaoui, L. E., Bartlett, P., & Jordan, M. (2004). Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5, 27–72.
Micchelli, C. A., & Pontil, M. (2005). Learning the kernel function via regularization. Journal of Machine Learning Research, 6, 1099–1125.
Mosci, S., Santoro, M., Verri, A., & Villa, S. (2008). A new algorithm to learn an optimal kernel based on Fenchel duality. In NIPS 2008 workshop: kernel learning: automatic selection of optimal kernels, Whistler.
Nath, J. S., Dinesh, G., Raman, S., Bhattacharyya, C., Ben-Tal, A., & Ramakrishnan, K. R. (2009). On the algorithmics and applications of a mixed-norm based kernel learning formulation. In Advances in neural information processing systems (Vol. 22, pp. 844–852). Cambridge: MIT Press.
Palmer, J., Wipf, D., Kreutz-Delgado, K., & Rao, B. (2006). Variational EM algorithms for non-Gaussian latent variable models. In Y. Weiss, B. Schölkopf, & J. Platt (Eds.), Advances in neural information processing systems (Vol. 18, pp. 1059–1066). Cambridge: MIT Press.
Platt, J. C. (1999). Using sparseness and analytic QP to speed training of support vector machines. In Advances in neural information processing systems (Vol. 11, pp. 557–563). Cambridge: MIT Press.
Powell, M. (1969). A method for nonlinear constraints in minimization problems. In R. Fletcher (Ed.), Optimization (pp. 283–298). London: Academic Press.
Rakotomamonjy, A., Bach, F., & Canu, S. Y. G. (2008). SimpleMKL. Journal of Machine Learning Research, 9, 2491–2521.
Rätsch, G., Onoda, T., & Müller, K. R. (2001). Soft margins for adaboost. Machine Learning, 42(3), 287–320.
Rockafellar, R. T. (1970). Convex analysis. Princeton: Princeton University Press.
Rockafellar, R. T. (1976). Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1, 97–116.
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge: MIT Press.
Sonnenburg, S., Rätsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531–1565.
Tomioka, R., & Sugiyama, M. (2009). Dual augmented lagrangian method for efficient sparse reconstruction. IEEE Signal Processing Letters, 16(12), 1067–1070.
Tomioka, R., & Suzuki, T. (2009). Sparsity-accuracy trade-off in MKL. arXiv:1001.2615.
Tomioka, R., & Suzuki, T. (2011). Regularization strategies and empirical Bayesian learning for MKL. arXiv:1011.3090.
Tomioka, R., Suzuki, T., & Sugiyama, M. (2011). Super-linear convergence of dual augmented lagrangian algorithm for sparse learning. Journal of Machine Learning Research, 12, 1501–1550.
Wright, S. J., Nowak, R. D., & Figueiredo, M. A. T. (2009). Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7), 2479–2493. doi:10.1109/TSP.2009.2016892.
Xu, Z., Jin, R., King, I., & Lyu, M. R. (2009). An extended level method for efficient multiple kernel learning. In Advances in neural information processing systems (Vol. 21, pp. 1825–1832). Cambridge: MIT Press.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1), 49–67.
Zangwill, W. I. (1969). Nonlinear programming: a unified approach. New York: Prentice Hall.
Zien, A., & Ong, C. (2007). Multiclass multiple kernel learning. In Proceedings of the 24th international conference on machine learning (pp. 11910–1198). New York: ACM.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Süreyya Özöǧür-Akyüz, Dervim Ünay, Alex Smola.
Rights and permissions
About this article
Cite this article
Suzuki, T., Tomioka, R. SpicyMKL: a fast algorithm for Multiple Kernel Learning with thousands of kernels. Mach Learn 85, 77–108 (2011). https://doi.org/10.1007/s10994-011-5252-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-011-5252-9