Some Equivalences between Kernel Methods and Information Theoretic Methods

  • Robert Jenssen
  • Torbjørn Eltoft
  • Deniz Erdogmus
  • Jose C. Principe


In this paper, we discuss some equivalences between two recently introduced statistical learning schemes, namely Mercer kernel methods and information theoretic methods. We show that Parzen window-based estimators for some information theoretic cost functions are also cost functions in a corresponding Mercer kernel space. The Mercer kernel is directly related to the Parzen window. Furthermore, we analyze a classification rule based on an information theoretic criterion, and show that this corresponds to a linear classifier in the kernel space. By introducing a weighted Parzen window density estimator, we also formulate the support vector machine in this information theoretic perspective.


Mercel kernel methods information theoretic methods Parzen window 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    J. Shawe-Taylor and N. Cristianini, “Kernel Methods for Pattern Analysis,” Cambridge University Press, 2004.Google Scholar
  2. 2.
    K.R. Müller, S. Mika, G. Rätsch, K. Tsuda and B. Schölkopf, “An Introduction to Kernel-Based Learning Algorithms,” IEEE Trans. Neural Netw., vol. 12, no. 2, 2001, pp. 181–201.CrossRefGoogle Scholar
  3. 3.
    F. Perez-Cruz and O. Bousquet, “Kernel Methods and Their Potential Use in Signal Processing,” IEEE Signal Process. Mag., 2004, pp. 57–65, May.Google Scholar
  4. 4.
    B. Schölkopf and A.J. Smola, “Learning with Kernels,” MIT, Cambridge, 2002.Google Scholar
  5. 5.
    C. Cortes and V.N. Vapnik, “Support Vector Networks,” Mach. Learn., vol. 20, 1995, pp. 273–297.Google Scholar
  6. 6.
    V.N. Vapnik, “The Nature of Statistical Learning Theory,” Springer, Berlin Heidelberg New York, 1995.zbMATHGoogle Scholar
  7. 7.
    N. Cristianini and J. Shawe-Taylor, “An Introduction to Support Vector Machines,” Cambridge University Press, Cambridge, 2000.Google Scholar
  8. 8.
    C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Knowledge Discovery and Data Mining, vol. 2, no. 2, 1998, pp. 121–167.CrossRefGoogle Scholar
  9. 9.
    T. Hastie, S. Rosset, R. Tibshirani and J. Zhu, “The Entire Regularization Path for the Support Vector Machine,” J. Mach. Learn. Res., vol. 5, 2004, pp. 1391–1415.Google Scholar
  10. 10.
    B. Schölkopf, A.J. Smola and K.R. Müller, “Nonlinear Component Analysis as a Kernel Eigenvalue Problem,” Neural Comput., vol. 10, 1998, pp. 1299–1319.CrossRefGoogle Scholar
  11. 11.
    S. Mika, G. Rätsch, J. Weston, B. Schölkopf and K.R. Müller, “Fisher Discriminant Analysis with Kernels,” in Proceedings of IEEE International Workshop on Neural Networks for Signal Processing, Madison, USA, August 23–25, 1999, pp. 41–48.Google Scholar
  12. 12.
    V. Roth and V. Steinhage, “Nonlinear Discriminant Analysis using Kernel Functions,” in Advances in Neural Information Processing Systems 12, MIT, Cambridge, 2000, pp. 568–574.Google Scholar
  13. 13.
    Y.A. LeCun, L.D. Jackel, L. Bottou, A. Brunot, C. Cortes, J.S. Denker, H. Drucker, I. Guyon, U.A. Müller, E. Säckinger, P.Y. Simard and V.N. Vapnik, “Learning Algorithms for Classification: A Comparison on Handwritten Digit Reconstruction,” Neural Netw., 1995, pp. 261–276.Google Scholar
  14. 14.
    K.R. Müller, A.J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen and V.N. Vapnik, “Predicting Time Series with Support Vector Machines,” in Proceedings of International Conference on Artificial Neural Networks—Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 1997, vol. 1327, pp. 999–1004.Google Scholar
  15. 15.
    A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer and K.R. Müller, “Engineering Support Vector Machine Kernels that Recognize Translation Invariant Sites in DNA,” Bioinformatics, vol. 16, 2000, pp. 906–914.CrossRefGoogle Scholar
  16. 16.
    J. Principe, D. Xu and J. Fisher, “Information Theoretic Learning,” in Unsupervised Adaptive Filtering, S. Haykin (Ed.), Wiley, New York, 2000, vol. I, Chapter 7.Google Scholar
  17. 17.
    J.C. Principe, D. Xu, Q. Zhao and J.W. Fisher, “Learning From Examples with Information Theoretic Criteria,” J. VLSI Signal Process., vol. 26, no. 1, 2000, pp. 61–77.CrossRefGoogle Scholar
  18. 18.
    S. Haykin, (Ed.), “Unsupervised Adaptive Filtering: Volume 1, Blind Source Separation, Wiley, New York, 2000.Google Scholar
  19. 19.
    E. Parzen, “On the Estimation of a Probability Density Function and the Mode,” Ann. Math. Stat., vol. 32, 1962, pp. 1065–1076.MathSciNetGoogle Scholar
  20. 20.
    L. Devroye, “On Random Variate Generation when only Moments or Fourier Coefficients are known,” Math. Comput. Simul., vol. 31, 1989, pp. 71–89.CrossRefMathSciNetGoogle Scholar
  21. 21.
    B.W. Silverman, “Density Estimation for Statistics and Data Analysis,” Chapman & Hall, London, 1986.zbMATHGoogle Scholar
  22. 22.
    D.W. Scott, “Multivariate Density Estimation, ” Wiley, New York, 1992.zbMATHGoogle Scholar
  23. 23.
    M.P. Wand and M.C. Jones, “Kernel Smooting, ” Chapman & Hall, London, 1995.Google Scholar
  24. 24.
    P.A. Viola, N.N. Schraudolph and T.J. Sejnowski, “Empirical Entropy Manipulation for Real-World Problems,” in Advances in Neural Information Processing Systems, 8, MIT, Cambridge, 1995, pp. 851–857.Google Scholar
  25. 25.
    P. Viola and W.M. Wells, “Alignment by Maximization of Mutual Information,” Int. J. Comput. Vis., vol. 24, no. 2, 1997, pp. 137–154.CrossRefGoogle Scholar
  26. 26.
    D. Xu, “Energy, Entropy and Information Potential for Neural Computation, Ph.D. thesis, University of Florida, Gainesville, FL, USA, 1999.Google Scholar
  27. 27.
    A. Renyi, “Some Fundamental Questions of Information Theory,” Selected Papers of Alfred Renyi, Akademiai Kiado, Budapest, vol. 2, 1976, pp. 526–552.Google Scholar
  28. 28.
    A. Renyi, “On Measures of Entropy and Information,” Selected Papers of Alfred Renyi, Akademiai Kiado, Budapest, vol. 2, 1976, pp. 565–580.Google Scholar
  29. 29.
    M. Lazaro, I. Santamaria, D. Erdogmus, K.E. Hild II, C. Pantaleon and J.C. Principe, “Stochastic Blind Equalization Based on PDF Fitting using Parzen Estimator,” IEEE Trans. Signal Process., vol. 53, no. 2, 2005, pp. 696–704.CrossRefMathSciNetGoogle Scholar
  30. 30.
    D. Erdogmus, K.E. Hild, Y.N. Rao and J.C. Principe, “Minimax Mutual Information Approach for Independent Component Analysis,” Neural Comput., vol. 16, 2004, pp. 1235–1252.CrossRefGoogle Scholar
  31. 31.
    D. Erdogmus, K.E. Hild, J.C. Principe, M. Lazaro and I. Santamaria, “Adaptive Blind Deconvolution of Linear Channels using Renyi’s Entropy with Parzen Window Estimation,” IEEE Trans. Signal Process., vol. 52, no. 6, 2004, pp. 1489–1498.CrossRefMathSciNetGoogle Scholar
  32. 32.
    D. Erdogmus and J.C. Principe, “Convergence Properties and Data Efficiency of the Minimum Error-Entropy Criterion in Adaline Training,” IEEE Trans. Signal Process., vol. 51, no. 7, 2003, pp. 1966–1978.CrossRefGoogle Scholar
  33. 33.
    D. Erdogmus, K.E. Hild and J.C. Principe, “Blind Source Separation using Renyi’s α-Marginal Entropies,” Neurocomputing, vol. 49, 2002, pp. 25–38.CrossRefGoogle Scholar
  34. 34.
    I. Santamaria, D. Erdogmus and J.C. Principe, “Entropy Minimization for Supervised Digital Communications Channel Equalization,” IEEE Trans. Signal Process., vol. 50, no. 5, 2002, pp. 1184–1192.CrossRefGoogle Scholar
  35. 35.
    D. Erdogmus and J.C. Principe, “Generalized Information Potential Criterion for Adaptive System Training,” IEEE Trans. Neural Netw., vol. 13, no. 5, 2002, pp. 1035–1044.CrossRefGoogle Scholar
  36. 36.
    D. Erdogmus and J.C. Principe, “An Error-Entropy Minimization Algorithm for Supervised Training of Nonlinear Adaptive Systems,” IEEE Trans. Signal Process., vol. 50, no. 7, 2002, pp. 1780–1786.CrossRefGoogle Scholar
  37. 37.
    J. Mercer, “Functions of Positive and Negative Type and their Connection with the Theory of Integral Equations,” Philos. Trans. Roy. Soc. London, vol. A, 1909, pp. 415–446.Google Scholar
  38. 38.
    M. Girolami, “Mercer Kernel-Based Clustering in Feature Space,” IEEE Trans. Neural Netw., vol. 13, no. 3, 2002, pp. 780–784.CrossRefGoogle Scholar
  39. 39.
    I.S. Dhillon, Y. Guan and B. Kulis, “Kernel K-means, Spectral Clustering and Normalized Cuts,” in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, USA, August 22–25, 2004, pp. 551–556.Google Scholar
  40. 40.
    L. Devroye and G. Lugosi, “Combinatorial Methods in Density Estimation,” Springer, Berlin Heidelberg New York, 2001.zbMATHGoogle Scholar
  41. 41.
    J.H. Friedman, “On Bias, Variance, 0/1 Loss, and the Curse-Of-Dimensionality,” Data Mining and Knowledge Discovery, vol. 1, no. 1, 1997, pp. 55–77.CrossRefGoogle Scholar
  42. 42.
    M. Girolami, “Orthogonal Series Density Estimation and the Kernel Eigenvalue Problem,” Neural Comput., vol. 14, no. 3, 2002, pp. 669–688.CrossRefGoogle Scholar
  43. 43.
    D.W. Scott, “Parametric Statistical Modeling by Integrated Squared Error,” Technometrics, vol. 43, 2001, pp. 274–285.CrossRefMathSciNetGoogle Scholar
  44. 44.
    J.N. Kapur, “Measures of Information and their Applications,” Wiley, New York, 1994.zbMATHGoogle Scholar
  45. 45.
    R. Jenssen, J.C. Principe and T. Eltoft, “Information Cut and Information Forces for Clustering,” in Proceedings of IEEE International Workshop on Neural Networks for Signal Processing, Toulouse, France, September 17–19, 2003, pp. 459–468.Google Scholar
  46. 46.
    M. Di Marzio and C.C. Taylor, “Kernel Density Classification and Boosting: An L2 Analysis,” Stat. Comput., vol. 15, no. 2, 2005, pp. 113–123.CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  • Robert Jenssen
    • 1
  • Torbjørn Eltoft
    • 1
  • Deniz Erdogmus
    • 2
  • Jose C. Principe
    • 3
  1. 1.Department of Physics and TechnologyUniversity of TromsøTromsøNorway
  2. 2.Computer Science and Engineering DepartmentOregon Graduate Institute, OHSUPortlandUSA
  3. 3.Department of Electrical and Computer EngineeringUniversity of FloridaGainesvilleUSA

Personalised recommendations