Kernels as features: On kernels, margins, and lowdimensional mappings
 MariaFlorina Balcan,
 Avrim Blum,
 Santosh Vempala
 … show all 3 hide
Abstract
Kernel functions are typically viewed as providing an implicit mapping of points into a highdimensional space, with the ability to gain much of the power of that space without incurring a high cost if the result is linearlyseparable by a large margin γ. However, the JohnsonLindenstrauss lemma suggests that in the presence of a large margin, a kernel function can also be viewed as a mapping to a lowdimensional space, one of dimension only \(\tilde{O}(1/\gamma^2)\) . In this paper, we explore the question of whether one can efficiently produce such lowdimensional mappings, using only blackbox access to a kernel function. That is, given just a program that computes K(x,y) on inputs x,y of our choosing, can we efficiently construct an explicit (small) set of features that effectively capture the power of the implicit highdimensional space? We answer this question in the affirmative if our method is also allowed blackbox access to the underlying data distribution (i.e., unlabeled examples). We also give a lower bound, showing that if we do not have access to the distribution, then this is not possible for an arbitrary blackbox kernel function; we leave as an open problem, however, whether this can be done for standard kernel functions such as the polynomial kernel. Our positive result can be viewed as saying that designing a good kernel function is much like designing a good feature space. Given a kernel, by running it in a blackbox manner on random unlabeled examples, we can efficiently generate an explicit set of \(\tilde{O}(1/\gamma^2)\) features, such that if the data was linearly separable with margin γ under the kernel, then it is approximately separable in this new feature space.
 Achlioptas, D. (2003) Databasefriendly random projections. Journal of Computer and System Sciences 66: pp. 671687 CrossRef
 Arriaga, R. I., & Vempala, S. (1999). An algorithmic theory of learning, robust concepts and random projection. Proceedings of the 40th foundations of computer science. Journal version to appear in Machine Learning, 616–623.
 Bartlett, P., & ShaweTaylor, J. (1999). Generalization performance of support vector machines and other pattern classifiers. Advances in kernel methods: Support vector learning (pp. 43–54). MIT Press.
 BenDavid, S., Eiron, N., Simon, H.U. (2003) Limitations of learning via embeddings in euclidean halfspaces. Journal of Machine Learning Research 3: pp. 441461 CrossRef
 BenDavid, S. (2001). A priori generalization bounds for kernel based learning. NIPS Workshop on kernel based learning.
 BenIsrael, A., Greville, T. N. E. (1974) Generalized inverses: Theory and applications. Wiley, New York
 Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. [http://www.ics.uci.edu/mlearn/MLRepository.html]
 Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on computational learning theory (pp. 144–152).
 Cortes, C., Vapnik, V. (1995) Supportvector networks. Machine Learning 20: pp. 273297
 Dasgupta, S., & Gupta, A. (1999). An elementary proof of the JohnsonLindenstrauss lemma. Tech Report, UC Berkeley.
 Freund, Y., Schapire, R. E. (1999) Large margin classification using the perceptron algorithm. Machine Learning 37: pp. 277296 CrossRef
 Gunn, S. R. (1997). Support vector machines for classification and regression. Technical Report, Image Speech and Intelligent Systems Research Group, University of Southampton.
 Goldreich, O., Goldwasser, S., Micali, S. (1986) How to construct random functions. Journal of the ACM 33: pp. 792807 CrossRef
 Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. Proceedings of the 30th annual ACM symposium on theory of computing (pp. 604–613).
 Herbrich, R. (2002) Learning kernel classifiers. MIT Press, Cambridge
 Johnson, W. B., Lindenstrauss, J. (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics 26: pp. 189206
 Littlestone, N. (1988) Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning 2: pp. 285318
 Muller, K. R., Mika, S., Ratsch, G., Tsuda, K., Scholkopf, B. (2001) An introduction to kernelbased learning algorithms. IEEE Transactions on Neural Networks 12: pp. 181201 CrossRef
 Nevo, Z., & ElYaniv, R. (2003). On online learning of decision lists. The Journal of Machine Learning Research, 3, 271–301.
 Scholkopf, B., Burges, C. J. C., & Mika, S. (1999). Advances in kernel methods: Support vector learning. MIT Press.
 ShaweTaylor, J., Bartlett, P. L., Williamson, R. C., Anthony, M. (1998) Structural risk minimization over datadependent hierarchies. IEEE Transactions on Information Theory 44: pp. 19261940 CrossRef
 ShaweTaylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press.
 Scholkopf, B., Tsuda, K., & Vert, J.P. (2004). Kernel methods in computational biology. MIT Press.
 Smola, A. J., Bartlett, P., Scholkopf, B., & Schuurmans D. (2000). (Eds.), Advances in large margin classifiers. MIT Press.
 Scholkopf, B., Smola, A. J. (2002) Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT University Press, Cambridge
 Vapnik, V. N. (1998). Statistical learning theory New York: John Wiley and Sons Inc.,
 http://www.isis.ecs.soton.ac.uk/resources/svminfo/
 Title
 Kernels as features: On kernels, margins, and lowdimensional mappings
 Journal

Machine Learning
Volume 65, Issue 1 , pp 7994
 Cover Date
 20061001
 DOI
 10.1007/s1099400675501
 Print ISSN
 08856125
 Online ISSN
 15730565
 Publisher
 Kluwer Academic Publishers
 Additional Links
 Topics
 Industry Sectors
 Authors

 MariaFlorina Balcan ^{(1)}
 Avrim Blum ^{(1)}
 Santosh Vempala ^{(2)}
 Author Affiliations

 1. Computer Science Department, Carnegie Mellon University, Pittsburgh
 2. Department of Mathematics, MIT, Cambridge