Abstract
Kernel functions are typically viewed as providing an implicit mapping of points into a high-dimensional space, with the ability to gain much of the power of that space without incurring a high cost if the result is linearly-separable by a large margin γ. However, the Johnson-Lindenstrauss lemma suggests that in the presence of a large margin, a kernel function can also be viewed as a mapping to a low-dimensional space, one of dimension only \(\tilde{O}(1/\gamma^2)\). In this paper, we explore the question of whether one can efficiently produce such low-dimensional mappings, using only black-box access to a kernel function. That is, given just a program that computes K(x,y) on inputs x,y of our choosing, can we efficiently construct an explicit (small) set of features that effectively capture the power of the implicit high-dimensional space? We answer this question in the affirmative if our method is also allowed black-box access to the underlying data distribution (i.e., unlabeled examples). We also give a lower bound, showing that if we do not have access to the distribution, then this is not possible for an arbitrary black-box kernel function; we leave as an open problem, however, whether this can be done for standard kernel functions such as the polynomial kernel. Our positive result can be viewed as saying that designing a good kernel function is much like designing a good feature space. Given a kernel, by running it in a black-box manner on random unlabeled examples, we can efficiently generate an explicit set of \(\tilde{O}(1/\gamma^2)\) features, such that if the data was linearly separable with margin γ under the kernel, then it is approximately separable in this new feature space.
Article PDF
Similar content being viewed by others
References
Achlioptas, D. (2003). Database-friendly random projections. Journal of Computer and System Sciences, 66(4), 671–687.
Arriaga, R. I., & Vempala, S. (1999). An algorithmic theory of learning, robust concepts and random projection. Proceedings of the 40th foundations of computer science. Journal version to appear in Machine Learning, 616–623.
Bartlett, P., & Shawe-Taylor, J. (1999). Generalization performance of support vector machines and other pattern classifiers. Advances in kernel methods: Support vector learning (pp. 43–54). MIT Press.
Ben-David, S., Eiron, N., & Simon, H.U. (2003). Limitations of learning via embeddings in euclidean half-spaces Journal of Machine Learning Research, 3, 441–461.
Ben-David, S. (2001). A priori generalization bounds for kernel based learning. NIPS Workshop on kernel based learning.
Ben-Israel, A., & Greville, T. N. E. (1974). Generalized inverses: Theory and applications. New York: Wiley
Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. [http://www.ics.uci.edu/mlearn/MLRepository.html]
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on computational learning theory (pp. 144–152).
Cortes, C., & Vapnik, V. (1995). Support-vector networks Machine Learning, 20(3), 273–297.
Dasgupta, S., & Gupta, A. (1999). An elementary proof of the Johnson-Lindenstrauss lemma. Tech Report, UC Berkeley.
Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm, Machine Learning, 37(3), 277–296.
Gunn, S. R. (1997). Support vector machines for classification and regression. Technical Report, Image Speech and Intelligent Systems Research Group, University of Southampton.
Goldreich, O., Goldwasser, S., & Micali, S. (1986). How to construct random functions. Journal of the ACM, 33(4), 792–807.
Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. Proceedings of the 30th annual ACM symposium on theory of computing (pp. 604–613).
Herbrich, R. (2002). Learning kernel classifiers Cambridge: MIT Press.
Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26 189–206.
Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4) 285–318.
Muller, K. R., Mika, S., Ratsch, G., Tsuda, K., & Scholkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201.
Nevo, Z., & El-Yaniv, R. (2003). On online learning of decision lists. The Journal of Machine Learning Research, 3, 271–301.
Scholkopf, B., Burges, C. J. C., & Mika, S. (1999). Advances in kernel methods: Support vector learning. MIT Press.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., Anthony, & M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940.
Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press.
Scholkopf, B., Tsuda, K., & Vert, J.-P. (2004). Kernel methods in computational biology. MIT Press.
Smola, A. J., Bartlett, P., Scholkopf, B., & Schuurmans D. (2000). (Eds.), Advances in large margin classifiers. MIT Press.
Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT University Press.
Vapnik, V. N. (1998). Statistical learning theory New York: John Wiley and Sons Inc.,
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Philip Long
A preliminary version of this paper appeared in Proceedings of the 15th International Conference on Algorithmic Learning Theory. Springer LNAI 3244, pp. 194–205, 2004.
Rights and permissions
About this article
Cite this article
Balcan, MF., Blum, A. & Vempala, S. Kernels as features: On kernels, margins, and low-dimensional mappings. Mach Learn 65, 79–94 (2006). https://doi.org/10.1007/s10994-006-7550-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-006-7550-1