A Hilbert Space Embedding for Distributions

  • Alex Smola
  • Arthur Gretton
  • Le Song
  • Bernhard Schölkopf
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4754)


We describe a technique for comparing distributions without the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a reproducing kernel Hilbert space. Applications of this technique can be found in two-sample tests, which are used for determining whether two sets of observations arise from the same distribution, covariate shift correction, local learning, measures of independence, and density estimation.


Hilbert Space Kernel Method Independent Component Analysis Maximal Clique Exponential Family 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)CrossRefzbMATHGoogle Scholar
  2. 2.
    Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002)zbMATHGoogle Scholar
  3. 3.
    Joachims, T.: Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. Kluwer Academic Publishers, Boston (2002)CrossRefGoogle Scholar
  4. 4.
    Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)zbMATHGoogle Scholar
  5. 5.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley and Sons, New York (1991)CrossRefzbMATHGoogle Scholar
  6. 6.
    Amari, S., Nagaoka, H.: Methods of Information Geometry. Oxford University Press (1993)Google Scholar
  7. 7.
    Krause, A., Guestrin, C.: Near-optimal nonmyopic value of information in graphical models. In: Uncertainty in Artificial Intelligence UAI 2005 (2005)Google Scholar
  8. 8.
    Slonim, N., Tishby, N.: Agglomerative information bottleneck. In: Solla, S.A., Leen, T.K., Müller, K.R. (eds.) Advances in Neural Information Processing Systems, vol. 12, pp. 617–623. MIT Press, Cambridge (2000)Google Scholar
  9. 9.
    Stögbauer, H., Kraskov, A., Astakhov, S., Grassberger, P.: Least dependent component analysis based on mutual information. Phys. Rev. E 70(6), 66123 (2004)CrossRefGoogle Scholar
  10. 10.
    Nemenman, I., Shafee, F., Bialek, W.: Entropy and inference, revisited. In: Neural Information Processing Systems, vol. 14, MIT Press, Cambridge (2002)Google Scholar
  11. 11.
    Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)CrossRefzbMATHGoogle Scholar
  12. 12.
    Schölkopf, B., Tsuda, K., Vert, J.P.: Kernel Methods in Computational Biology. MIT Press, Cambridge (2004)Google Scholar
  13. 13.
    Hofmann, T., Schölkopf, B., Smola, A.J.: A review of kernel methods in machine learning. Technical Report 156, Max-Planck-Institut für biologische Kybernetik (2006)Google Scholar
  14. 14.
    Steinwart, I.: The influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research 2 (2002)Google Scholar
  15. 15.
    Fukumizu, K., Bach, F.R., Jordan, M.I.: Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res. 5, 73–99 (2004)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. In: Schölkopf, B., Platt, J., Hofmann, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, MIT Press, Cambridge (2007)Google Scholar
  17. 17.
    Altun, Y., Smola, A.: Unifying divergence minimization and statistical inference via convex duality. In: Simon, H., Lugosi, G. (eds.) Proc. Annual Conf. Computational Learning Theory, pp. 139–153. Springer, Heidelberg (2006)Google Scholar
  18. 18.
    Bartlett, P.L., Mendelson, S.: Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Koltchinskii, V.: Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47, 1902–1914 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16(2), 264–281 (1971)CrossRefzbMATHGoogle Scholar
  21. 21.
    Vapnik, V., Chervonenkis, A.: The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya 26(3), 543–564 (1981)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Department of Statistics (September 2003)Google Scholar
  23. 23.
    Ravikumar, P., Lafferty, J.: Variational chernoff bounds for graphical models. In: Uncertainty in Artificial Intelligence UAI 2004 (2004)Google Scholar
  24. 24.
    Altun, Y., Smola, A.J., Hofmann, T.: Exponential families for conditional random fields. In: Uncertainty in Artificial Intelligence (UAI), Arlington, Virginia, pp. 2–9. AUAI Press (2004)Google Scholar
  25. 25.
    Hammersley, J.M., Clifford, P.E.: Markov fields on finite graphs and lattices (unpublished manuscript, 1971)Google Scholar
  26. 26.
    Besag, J.: Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Roy. Stat. Soc. Ser. B Stat. Methodol. 36(B), 192–326 (1974)MathSciNetzbMATHGoogle Scholar
  27. 27.
    Hein, M., Bousquet, O.: Hilbertian metrics and positive definite kernels on probability measures. In: Ghahramani, Z., Cowell, R. (eds.) Proc. of AI & Statistics, vol. 10 (2005)Google Scholar
  28. 28.
    Serfling, R.: Approximation Theorems of Mathematical Statistics. Wiley, New York (1980)CrossRefzbMATHGoogle Scholar
  29. 29.
    Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 13–30 (1963)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems 19, MIT Press, Cambridge (2007)Google Scholar
  31. 31.
    McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics, pp. 148–188. Cambridge University Press, Cambridge (1969)Google Scholar
  32. 32.
    Anderson, N., Hall, P., Titterington, D.: Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis 50, 41–54 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Grimmet, G.R., Stirzaker, D.R.: Probability and Random Processes, 3rd edn. Oxford University Press, Oxford (2001)Google Scholar
  34. 34.
    Arcones, M., Giné, E.: On the bootstrap of u and v statistics. The Annals of Statistics 20(2), 655–674 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Johnson, N.L., Kotz, S., Balakrishnan, N.: Continuous Univariate Distributions, 2nd edn., vol. 1. John Wiley and Sons, Chichester (1994)Google Scholar
  36. 36.
    Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.P., Schölkopf, B., Smola, A.J.: Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14), e49–e57 (2006)Google Scholar
  37. 37.
    Huang, J., Smola, A., Gretton, A., Borgwardt, K., Schölkopf, B.: Correcting sample selection bias by unlabeled data. In: Schölkopf, B., Platt, J., Hofmann, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, MIT Press, Cambridge (2007)Google Scholar
  38. 38.
    Shimodaira, H.: Improving predictive inference under convariance shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90 (2000)Google Scholar
  39. 39.
    Bottou, L., Vapnik, V.N.: Local learning algorithms. Neural Computation 4(6), 888–900 (1992)CrossRefGoogle Scholar
  40. 40.
    Comon, P.: Independent component analysis, a new concept? Signal Processing 36, 287–314 (1994)CrossRefzbMATHGoogle Scholar
  41. 41.
    Lee, T.W., Girolami, M., Bell, A., Sejnowski, T.: A unifying framework for independent component analysis. Comput. Math. Appl. 39, 1–21 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Bach, F.R., Jordan, M.I.: Kernel independent component analysis. J. Mach. Learn. Res. 3, 1–48 (2002)MathSciNetzbMATHGoogle Scholar
  43. 43.
    Gretton, A., Bousquet, O., Smola, A., Schölkopf, B.: Measuring statistical dependence with Hilbert-Schmidt norms. In: Jain, S., Simon, H.U., Tomita, E. (eds.) Proceedings Algorithmic Learning Theory, pp. 63–77. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  44. 44.
    Gretton, A., Herbrich, R., Smola, A., Bousquet, O., Schölkopf, B.: Kernel methods for measuring independence. J. Mach. Learn. Res. 6, 2075–2129 (2005)MathSciNetzbMATHGoogle Scholar
  45. 45.
    Shen, H., Jegelka, S., Gretton, A.: Fast kernel ICA using an approximate newton method. In: AISTATS 11 (2007)Google Scholar
  46. 46.
    Feuerverger, A.: A consistent test for bivariate dependence. International Statistical Review 61(3), 419–433 (1993)CrossRefzbMATHGoogle Scholar
  47. 47.
    Kankainen, A.: Consistent Testing of Total Independence Based on the Empirical Characteristic Function. PhD thesis, University of Jyväskylä (1995)Google Scholar
  48. 48.
    Burges, C.J.C., Vapnik, V.: A new method for constructing artificial neural networks. Interim technical report, ONR contract N00014-94-c-0186, AT&T Bell Laboratories (1995)Google Scholar
  49. 49.
    Vapnik, V.: Statistical Learning Theory. John Wiley and Sons, New York (1998)zbMATHGoogle Scholar
  50. 50.
    Anemuller, J., Duann, J.R., Sejnowski, T.J., Makeig, S.: Spatio-temporal dynamics in fmri recordings revealed with complex independent component analysis. Neurocomputing 69, 1502–1512 (2006)CrossRefGoogle Scholar
  51. 51.
    Schölkopf, B.: Support Vector Learning. R. Oldenbourg Verlag, Munich (1997), zbMATHGoogle Scholar
  52. 52.
    Song, L., Smola, A., Gretton, A., Borgwardt, K., Bedo, J.: Supervised feature selection via dependence estimation. In: Proc. Intl. Conf. Machine Learning (2007)Google Scholar
  53. 53.
    Song, L., Bedo, J., Borgwardt, K., Gretton, A., Smola, A.: Gene selection via the BAHSIC family of algorithms. In: Bioinformatics (ISMB) (to appear, 2007)Google Scholar
  54. 54.
    van’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A.M., et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002)CrossRefGoogle Scholar
  55. 55.
    Ein-Dor, L., Zuk, O., Domany, E.: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. USA 103(15), 5923–5928 (2006)CrossRefGoogle Scholar
  56. 56.
    Bedo, J., Sanderson, C., Kowalczyk, A.: An efficient alternative to svm based recursive feature elimination with applications in natural language processing and bioinformatics. Artificial Intelligence (2006)Google Scholar
  57. 57.
    Smyth, G.: Linear models and empirical bayes methods for assessing differential expressionin microarray experiments. Statistical Applications in Genetics and Molecular Biology 3 (2004)Google Scholar
  58. 58.
    Lönnstedt, I., Speed, T.: Replicated microarray data. Statistica Sinica 12, 31–46 (2002)MathSciNetzbMATHGoogle Scholar
  59. 59.
    Dudík, M., Schapire, R., Phillips, S.: Correcting sample selection bias in maximum entropy density estimation. Advances in Neural Information Processing Systems 17 (2005)Google Scholar
  60. 60.
    Dudík, M., Schapire, R.E.: Maximum entropy distribution estimation with generalized regularization. In: Lugosi, G., Simon, H.U. (eds.) Proc. Annual Conf. Computational Learning Theory, Springer, Heidelberg (2006)Google Scholar
  61. 61.
    Hlawka, E.: Funktionen von beschränkter variation in der theorie der gleichverteilung. Annali di Mathematica Pura ed Applicata 54 (1961)Google Scholar
  62. 62.
    Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. IN: Proc. Intl. Conf. Machine Learning (2002)Google Scholar
  63. 63.
    Jebara, T., Kondor, I.: Bhattacharyya and expected likelihood kernels. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 57–71. Springer, Heidelberg (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Alex Smola
    • 1
  • Arthur Gretton
    • 2
  • Le Song
    • 1
  • Bernhard Schölkopf
    • 2
  1. 1.National ICT Australia, North Road, Canberra 0200 ACTAustralia
  2. 2.MPI for Biological Cybernetics, Spemannstr. 38, 72076 TübingenGermany

Personalised recommendations