A Performance Evaluation of Mutual Information Estimators for Multivariate Feature Selection

Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 204)


Mutual information is one of the most popular criteria used in feature selection, for which many estimation techniques have been proposed. The large majority of them are based on probability density estimation and perform badly when faced to high-dimensional data, because of the curse of dimensionality. However, being able to evaluate robustly the mutual information between a subset of features and an output vector can be of great interest in feature selection. This is particularly the case when some features are only jointly redundant or relevant. In this paper, different mutual information estimators are compared according to important criteria for feature selection; the interest of a nearest neighbors-based estimator is shown.


Mutual information estimation Feature selection Density estimation Nearest neighbors 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bellman, R.E.: Adaptive control processes - A guided tour. Princeton University Press (1961)Google Scholar
  2. 2.
    Shannon, C.E.: A mathematical Theory of Communication. Bell Syst. Tech. J. 27, 379–423, 623–656 (1948)Google Scholar
  3. 3.
    Rossi, F., Lendasse, A., François, D., Wertz, V., Verleysen, M.: Mutual Information for the Selection of Relevant Variables in Spectrometric Nonlinear Modelling. Chemometr. Intell. Lab. 80, 215–226 (2006)CrossRefGoogle Scholar
  4. 4.
    Peng, H., Fuhui, L., Chris, D.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE T. Pattern Anal. 27, 1226–1238 (2005)CrossRefGoogle Scholar
  5. 5.
    Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. J. Mach. Lear. Res. 3, 1157–1182 (2003)MATHGoogle Scholar
  6. 6.
    Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating Mutual Information. Phys. Rev. E 69, 066138 (2004)Google Scholar
  7. 7.
    François, D., Rossi, F., Wertz, V., Verleysen, M.: Resampling Methods for Parameter-free and Robust Feature Selection with Mutual Information. Neurocomputing 70, 1276–1288 (2007)CrossRefGoogle Scholar
  8. 8.
    Sturges, H.A.: The Choice of a Class Interval. J. Am. Stat. Assoc. 21, 65–66 (1926)CrossRefGoogle Scholar
  9. 9.
    Scott, D.W.: On optimal and data-based histograms. Biometrika 66, 605–610 (1979)MathSciNetMATHCrossRefGoogle Scholar
  10. 10.
    Parzen, E.: On Estimation of a Probability Density Function and Mode. Ann. Math. Statist. 33, 1065–1076 (1962)MathSciNetMATHCrossRefGoogle Scholar
  11. 11.
    Silverman, B.W.: Density estimation for statistics and data analysis. Chapman and Hall, London (1986)MATHGoogle Scholar
  12. 12.
    Turlach, B.A.: Bandwidth Selection in Kernel Density Estimation: A Review. CORE and Institut de Statistique, 23–493 (1993)Google Scholar
  13. 13.
    Daub, C., Steuer, R., Selbig, J., Kloska, S.: Estimating mutual information using B-spline functions - an improved similarity measure for analysing gene expression data. BMC Bioinformatics 5 (2004)Google Scholar
  14. 14.
    Darbellay, G.A., Vajda, I.: Estimation of the information by an adaptive partitioning of the observation space. IEEE T. Inform. Theory 45(4), 1315–1321 (1999)MathSciNetMATHCrossRefGoogle Scholar
  15. 15.
    Li, S., Mnatsakanov, R.M., Andrew, M.E.: k-Nearest Neighbor Based Consistent Entropy Estimation for Hyperspherical Distributions. Entropy 13, 650–667 (2011)MathSciNetMATHCrossRefGoogle Scholar
  16. 16.
    Walters-Williams, J., Li, Y.: Estimation of Mutual Information: A Survey. In: Wen, P., Li, Y., Polkowski, L., Yao, Y., Tsumoto, S., Wang, G. (eds.) RSKT 2009. LNCS, vol. 5589, pp. 389–396. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  17. 17.
    Friedman, J.H.: Multivariate Adaptive Regression Splines. Ann. Stat. 19, 1–67 (1991)MATHCrossRefGoogle Scholar
  18. 18.
    Rossi, F., Delannay, N., Conan-Guez, B., Verleysen, M.: Representation of functional data in neural networks. Neurocomputing 64, 183–210 (2005)CrossRefGoogle Scholar
  19. 19.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience (1981)Google Scholar
  20. 20.
    Bowman, A.W.: An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71, 353–360 (1984)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Rudemo, M.: Empirical Choice of Histograms and Kernel Density Estimators. Scand. J. Stat. 9 (1982)Google Scholar
  22. 22.
    Hall, P., Sheater, S.J., Jones, M.C., Marron, J.S.: On optimal data-based bandwidth selection in kernel density estimation. Biometrika 78, 263–269 (1991)MathSciNetMATHCrossRefGoogle Scholar
  23. 23.
    Gomez-Verdejo, V., Verleysen, M., Fleury, J.: Information-Theoretic Feature Selection for Functional Data Classification. Neurocomputing 72, 3580–3589 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Machine Learning Group-ICTEAMUniversité catholique de LouvainLouvain-la-NeuveBelgium

Personalised recommendations