Machine Learning

, Volume 48, Issue 1–3, pp 165–187

Kernel Matching Pursuit

  • Pascal Vincent
  • Yoshua Bengio
Article

Abstract

Matching Pursuit algorithms learn a function that is a weighted sum of basis functions, by sequentially appending functions to an initially empty basis, to approximate a target function in the least-squares sense. We show how matching pursuit can be extended to use non-squared error loss functions, and how it can be used to build kernel-based solutions to machine learning problems, while keeping control of the sparsity of the solution. We present a version of the algorithm that makes an optimal choice of both the next basis and the weights of all the previously chosen bases. Finally, links to boosting algorithms and RBF training procedures, as well as an extensive experimental comparison with SVMs for classification are given, showing comparable results with typically much sparser models.

kernel methods matching pursuit sparse approximation support vector machines radial basis functions boosting 

References

  1. Aizerman, M., Braverman, E., & Rozonoer, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821–837.Google Scholar
  2. Boser, B., Guyon, I., & Vapnik, V. (1992). An algorithm for optimal margin classifiers. In: Fifth Annual Workshop on Computational Learning Theory (pp. 144-152). Pittsburgh.Google Scholar
  3. Chen, S. (1995). Basis Pursuit. Ph.D. Thesis, Department of Statistics, Stanford University.Google Scholar
  4. Chen, S., Cowan, F., & Grant, P. (1991). Orthogonal least squares learning algorithm for radial basis function networks. IEEE Transactions on Neural Networks, 2:2, 302–309.Google Scholar
  5. Davis, G., Mallat, S., & Zhang, Z. (1994). Adaptive time-frequency decompositions. Optical Engineering, 33:7, 2183–2191.Google Scholar
  6. Floyd, S., & Warmuth, M. (1995). Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine Learning, 21:3, 269–304.Google Scholar
  7. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of Thirteenth International Conference (pp. 148-156).Google Scholar
  8. Freund, Y., & Schapire, R. E. (1998). Large margin classification using the perceptron algorithm. In Proc. 11th Annu. Conf. on Comput. Learning Theory (pp. 209–217). New York, NY: ACM Press.Google Scholar
  9. Friedman, J. (1999). Greedy function approximation: A gradient boosting machine. IMS 1999 Reitz Lecture, February 24, 1999, Department of Statistics, Stanford University.Google Scholar
  10. Friedman, J., Hastie, T., & Tibshirani, R. (1998). Additive logistic regression: A statistical view of boosting. Technical Report, Aug. 1998, Department of Statistics, Stanford University.Google Scholar
  11. Gallant, S. (1986). Optimal linear discriminants. In Eighth International Conference on Pattern Recognition. Paris, 1986 (pp. 849–852). New York: IEEE.Google Scholar
  12. Graepel, T., Herbrich, R., & Shawe-Taylor, J. (2000). Generalization error bounds for sparse linear classifiers. In Proceeding of the Thirteenth Annual Conference on Computational Learning Theory (pp. 298-303). Morgan Kaufmann.Google Scholar
  13. Gunn, S., & Kandola, J. (2002). Structural modelling with sparse kernels. Machine Learning special issue on New Methods for Model Combination and Model Selection, 48:1/2/3, 137–163.Google Scholar
  14. LeCun, Y., Bottou, L., Orr, G., & Müller, K.-R. (1998). Efficient backprop. In G. Orr & K.-R. Müller (Eds.), Neural networks: Tricks of the trade (pp. 9–50). Berlin: Springer.Google Scholar
  15. Littlestone, N., & Warmuth, M. (1986). Relating data compression and learnability. Unpublished manuscript. University of California Santa Cruz. An extended version can be found in Machine Learning, 21:3, 1269–304.Google Scholar
  16. Mallat, S., & Zhang, Z. (1993). Matching pursuit with time-frequency dictionaries. IEEE Trans. Signal Proc., 41:12, 3397–3415.Google Scholar
  17. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (2000). Boosting algorithms as gradient descent. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12) (pp. 512-518). MIT Press.Google Scholar
  18. Nadeau, C., & Bengio, Y. (2000). Inference for the generalization Error. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12) (pp. 307–313). Cambridge, MA: MIT Press.Google Scholar
  19. Pati,Y., Rezaiifar, R., & Krishnaprasad, P. (1993). Orthogonal matching pursuit: Recursive function approximation with applications towavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference on Signals, Systems, and Computers (pp. 40-44).Google Scholar
  20. Poggio, T., & Girosi, F. (1998). A sparse representation for function approximation. Neural Computation, 10:6, 1445–1454.Google Scholar
  21. Rasmussen, C., Neal, R., Hinton, G., van Camp, D., Ghahramani, Z., Kustra, R., & Tibshirani, R. (1996). The DELVE manual. DELVE can be found at http://www.cs.toronto.edu/~delveGoogle Scholar
  22. Rosenblatt, F. (1957). The perceptron-a perceiving and recognizing automaton. Technical Report 85-460-1, Cornell Aeronautical Laboratory, Ithaca, N.Y.Google Scholar
  23. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:5, 1651–1686.Google Scholar
  24. Schölkopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., & Vapnik, V. (1997). Comparing support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Transactions on Signal Processing, 45, 2758–2765.Google Scholar
  25. Shawe-Taylor, J., Bartlett, P., Williamson, R., & Anthony, M. (1998). Structural risk minimization over datadependent hierarchies. IEEE Transactions on Information Theory, 44:5, 1926–1940.Google Scholar
  26. Singer, Y. (2000). Leveraged vector machines. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12) (pp. 610-616). MIT Press.Google Scholar
  27. Smola, A. J., & Bartlett, P. (2001). Sparse greedy gaussian process regression. In Advances in neural information processing systems (Vol. 13).Google Scholar
  28. Smola, A. J., Friess, T., & Schölkopf, B. (1999). Semiparametric support vector and linear programming machines. In M.Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems (Vol. 11) (pp. 585-591). MIT Press.Google Scholar
  29. Smola, A., & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In P. Langley (Ed.), International Conference on Machine Learning (pp. 911–918). San Francisco: Morgan Kaufmann.Google Scholar
  30. Tipping, M. (2000). The relevance vector machine. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12) (pp. 652–658). Cambridge, MA: MIT Press.Google Scholar
  31. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.Google Scholar
  32. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Lecture Notes in Economics and Mathematical Systems, Vol. 454.Google Scholar
  33. Weston, J., Gammerman, A., Stitson, M., Vapnik, V., Vovk, V., & Watkins, C. (1999). Density estimation using support vector machines. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods-support vector learning (pp. 293–306). Cambridge, MA: MIT Press.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Pascal Vincent
    • 1
  • Yoshua Bengio
    • 1
  1. 1.Department of IROUniversité de MontréalMontrealCanada

Personalised recommendations