Abstract
Kernel functions have become an extremely popular tool in machine learning, with an attractive theory as well. This theory views a kernel as implicitly mapping data points into a possibly very high dimensional space, and describes a kernel function as being good for a given learning problem if data is separable by a large margin in that implicit space. However, while quite elegant, this theory does not necessarily correspond to the intuition of a good kernel as a good measure of similarity, and the underlying margin in the implicit space usually is not apparent in “natural” representations of the data. Therefore, it may be difficult for a domain expert to use the theory to help design an appropriate kernel for the learning task at hand. Moreover, the requirement of positive semi-definiteness may rule out the most natural pairwise similarity functions for the given problem domain.
In this work we develop an alternative, more general theory of learning with similarity functions (i.e., sufficient conditions for a similarity function to allow one to learn well) that does not require reference to implicit spaces, and does not require the function to be positive semi-definite (or even symmetric). Instead, our theory talks in terms of more direct properties of how the function behaves as a similarity measure. Our results also generalize the standard theory in the sense that any good kernel function under the usual definition can be shown to also be a good similarity function under our definition (though with some loss in the parameters). In this way, we provide the first steps towards a theory of kernels and more general similarity functions that describes the effectiveness of a given function in terms of natural similarity-based properties.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Anthony, M., & Bartlett, P. (1999). Neural network learning: theoretical foundations. Cambridge: Cambridge University Press.
Arora, S., Babai, L., Stern, J., & Sweedyk, Z. (1997). The hardness of approximate optima in lattices, codes, and systems of linear equations. Journal of Computer and System Sciences, 54, 317–331.
Balcan, M.-F., & Blum, A. (2006). On a theory of learning with similarity functions. In International conference on machine learning.
Balcan, M.-F., Blum, A., & Vempala, S. (2006). Kernels as features: on kernels, margins, and low-dimensional mappings. Machine Learning, 65(1), 79–94.
Balcan, M.-F., Blum, A., & Vempala, S. (2008). A discriminative framework for clustering via similarity functions. In Proceedings of the 40th ACM symposium on theory of computing.
Bartlett, P. L., & Mendelson, S. (2003). Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3), 277–296.
Herbrich, R. (2002). Learning kernel classifiers. Cambridge: MIT Press.
Hettich, R., & Kortanek, K. O. (1993). Semi-infinite programming: theory, methods, and applications. SIAM Review, 35(3), 380–429.
Jaakkola, T. S., & Haussler, D. (1999). Exploiting generative models in discriminative classifiers. In Advances in neural information processing systems (Vol. 11). Cambridge: MIT Press.
Joachims, T. (2002). Learning to classify text using support vector machines: methods, theory, and algorithms. Dordrecht: Kluwer.
Kalai, A., Klivans, A., Mansour, Y., & Servedio, R. (2005). Agnostically learning half spaces. In Proceedings of the 46th annual symposium on the foundations of computer science.
Kearns, M., & Vazirani, U. (1994). An introduction to computational learning theory. Cambridge: MIT Press.
Lanckriet, G. R. G., Cristianini, N., Bartlett, P. L., El Ghaoui, L., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72.
Liao, L., & Noble, W. S. (2003). Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of Computational Biology, 10(6), 857–868.
McAllester, D. (2003). Simplified Pac-Bayesian margin bounds. In Proceedings of the 16th conference on computational learning theory.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227.
Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Support vector machines, regularization, optimization, and beyond. Cambridge: MIT University Press.
Scholkopf, B., Tsuda, K., & Vert, J.-P. (2004). Kernel methods in computational biology. Cambridge: MIT Press.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940.
Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.
Smola, A. J., & Scholkopf, B. (2002). Learning with kernels. Cambridge: MIT Press.
Srebro, N. (2007). How good is a kernel as a similarity function. In COLT.
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.
Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.
Viswanathan, S. V. N., & Smola, A. J. (2003). Fast kernels for string and tree matching. In Advances in neural information processing systems (Vol. 15). Cambridge: MIT Press.
Wang, L., Yang, C., & Feng, J. (2007). On learning with dissimilarity functions. In Proceedings of the 24th international conference on machine learning (pp. 991–998).
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Claudio Gentile, Nader H. Bshouty.
This paper combines results appearing in two conference papers: M.F. Balcan and A. Blum, “On a Theory of Learning with Similarity Functions,” Proc. 23rd International Conference on Machine Learning, 2006, and N. Srebro, “How Good is a Kernel as a Similarity Function,” Proc. 20th Annual Conference on Learning Theory, 2007.
Rights and permissions
About this article
Cite this article
Balcan, MF., Blum, A. & Srebro, N. A theory of learning with similarity functions. Mach Learn 72, 89–112 (2008). https://doi.org/10.1007/s10994-008-5059-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-008-5059-5