Machine Learning

, Volume 65, Issue 1, pp 79–94 | Cite as

Kernels as features: On kernels, margins, and low-dimensional mappings

  • Maria-Florina Balcan
  • Avrim BlumEmail author
  • Santosh Vempala


Kernel functions are typically viewed as providing an implicit mapping of points into a high-dimensional space, with the ability to gain much of the power of that space without incurring a high cost if the result is linearly-separable by a large margin γ. However, the Johnson-Lindenstrauss lemma suggests that in the presence of a large margin, a kernel function can also be viewed as a mapping to a low-dimensional space, one of dimension only \(\tilde{O}(1/\gamma^2)\). In this paper, we explore the question of whether one can efficiently produce such low-dimensional mappings, using only black-box access to a kernel function. That is, given just a program that computes K(x,y) on inputs x,y of our choosing, can we efficiently construct an explicit (small) set of features that effectively capture the power of the implicit high-dimensional space? We answer this question in the affirmative if our method is also allowed black-box access to the underlying data distribution (i.e., unlabeled examples). We also give a lower bound, showing that if we do not have access to the distribution, then this is not possible for an arbitrary black-box kernel function; we leave as an open problem, however, whether this can be done for standard kernel functions such as the polynomial kernel. Our positive result can be viewed as saying that designing a good kernel function is much like designing a good feature space. Given a kernel, by running it in a black-box manner on random unlabeled examples, we can efficiently generate an explicit set of \(\tilde{O}(1/\gamma^2)\) features, such that if the data was linearly separable with margin γ under the kernel, then it is approximately separable in this new feature space.


Kernel Function Mach Learn Target Function Large Margin Polynomial Kernel 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Achlioptas, D. (2003). Database-friendly random projections. Journal of Computer and System Sciences, 66(4), 671–687.zbMATHMathSciNetCrossRefGoogle Scholar
  2. Arriaga, R. I., & Vempala, S. (1999). An algorithmic theory of learning, robust concepts and random projection. Proceedings of the 40th foundations of computer science. Journal version to appear in Machine Learning, 616–623.Google Scholar
  3. Bartlett, P., & Shawe-Taylor, J. (1999). Generalization performance of support vector machines and other pattern classifiers. Advances in kernel methods: Support vector learning (pp. 43–54). MIT Press.Google Scholar
  4. Ben-David, S., Eiron, N., & Simon, H.U. (2003). Limitations of learning via embeddings in euclidean half-spaces Journal of Machine Learning Research, 3, 441–461.zbMATHMathSciNetCrossRefGoogle Scholar
  5. Ben-David, S. (2001). A priori generalization bounds for kernel based learning. NIPS Workshop on kernel based learning.Google Scholar
  6. Ben-Israel, A., & Greville, T. N. E. (1974). Generalized inverses: Theory and applications. New York: WileyGoogle Scholar
  7. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. []
  8. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on computational learning theory (pp. 144–152).Google Scholar
  9. Cortes, C., & Vapnik, V. (1995). Support-vector networks Machine Learning, 20(3), 273–297.zbMATHGoogle Scholar
  10. Dasgupta, S., & Gupta, A. (1999). An elementary proof of the Johnson-Lindenstrauss lemma. Tech Report, UC Berkeley.Google Scholar
  11. Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm, Machine Learning, 37(3), 277–296.zbMATHCrossRefGoogle Scholar
  12. Gunn, S. R. (1997). Support vector machines for classification and regression. Technical Report, Image Speech and Intelligent Systems Research Group, University of Southampton.Google Scholar
  13. Goldreich, O., Goldwasser, S., & Micali, S. (1986). How to construct random functions. Journal of the ACM, 33(4), 792–807.MathSciNetCrossRefGoogle Scholar
  14. Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. Proceedings of the 30th annual ACM symposium on theory of computing (pp. 604–613).Google Scholar
  15. Herbrich, R. (2002). Learning kernel classifiers Cambridge: MIT Press.Google Scholar
  16. Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26 189–206.zbMATHMathSciNetGoogle Scholar
  17. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4) 285–318.Google Scholar
  18. Muller, K. R., Mika, S., Ratsch, G., Tsuda, K., & Scholkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201.CrossRefGoogle Scholar
  19. Nevo, Z., & El-Yaniv, R. (2003). On online learning of decision lists. The Journal of Machine Learning Research, 3, 271–301.Google Scholar
  20. Scholkopf, B., Burges, C. J. C., & Mika, S. (1999). Advances in kernel methods: Support vector learning. MIT Press.Google Scholar
  21. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., Anthony, & M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940.zbMATHMathSciNetCrossRefGoogle Scholar
  22. Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press.Google Scholar
  23. Scholkopf, B., Tsuda, K., & Vert, J.-P. (2004). Kernel methods in computational biology. MIT Press.Google Scholar
  24. Smola, A. J., Bartlett, P., Scholkopf, B., & Schuurmans D. (2000). (Eds.), Advances in large margin classifiers. MIT Press.Google Scholar
  25. Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT University Press.Google Scholar
  26. Vapnik, V. N. (1998). Statistical learning theory New York: John Wiley and Sons Inc.,Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  • Maria-Florina Balcan
    • 1
  • Avrim Blum
    • 1
    Email author
  • Santosh Vempala
    • 2
  1. 1.Computer Science DepartmentCarnegie Mellon UniversityPittsburgh
  2. 2.Department of MathematicsMITCambridge

Personalised recommendations