Feature Selection in Regression Tasks Using Conditional Mutual Information

  • Pedro Latorre Carmona
  • José M. Sotoca
  • Filiberto Pla
  • Frederick K. H. Phoa
  • José Bioucas Dias
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6669)


This paper presents a supervised feature selection method applied to regression problems. The selection method uses a Dissimilarity matrix originally developed for classification problems, whose applicability is extended here to regression and built using the conditional mutual information between features with respect to a continuous relevant variable that represents the regression function. Applying an agglomerative hierarchical clustering technique, the algorithm selects a subset of the original set of features. The proposed technique is compared with other three methods. Experiments on four data-sets of different nature are presented to show the importance of the features selected from the point of view of the regression estimation error (using Support Vector Regression) considering the Root Mean Squared Error (RMSE).


Feature Selection Regression Information measures Conditional Density Estimation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & Sons Inc., Chichester (1991)CrossRefzbMATHGoogle Scholar
  2. 2.
    Drucker, H., Burges, C., Kaufman, L., Kaufman, L., Smola, A., Vapnik, V.: Support Vector Regression Machines. In: NIPS 1996, pp. 155–161 (1996)Google Scholar
  3. 3.
    Friedmann, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)CrossRefzbMATHGoogle Scholar
  4. 4.
    García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences 180, 2044–2064 (2010)CrossRefGoogle Scholar
  5. 5.
    Gill, P.E., Murray, W., Wright, M.H.: Practical Optimization. Academic Press, London (1981)zbMATHGoogle Scholar
  6. 6.
    Harrison, D., Rubinfeld, D.L.: Hedonic prices and the demand for clean air. J. Environ. Econ. Manag. 5, 81–102 (1978)CrossRefzbMATHGoogle Scholar
  7. 7.
    Holmes, M.P., Gray, A., Isbell, C.L.: Fast kernel conditional density estimation: A dual-tree Monte Carlo approach. Comput. Stat. Data Analysis 54(7), 1707–1718 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Hyndman, R.J., Bashtannyk, D.M., Grunwald, G.K.: Estimating and visualizing conditional densities. Journal of Computation and graphical Statistics 5(4), 315–336 (1996)MathSciNetGoogle Scholar
  9. 9.
    Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: A review. IEEE Trans. PAMI 22(1), 4–37 (2000)CrossRefGoogle Scholar
  10. 10.
    Kennedy, J., Eberhart, R.: Particle Swarm Optimization. In: IEEE ICNN, pp. 1942–1948 (1995)Google Scholar
  11. 11.
    Kwak, N., Choi, C.-H.: Input feature selection by mutual information based on parzen window. IEEE Trans. PAMI 24(12), 1667–1671 (2002)CrossRefGoogle Scholar
  12. 12.
    Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1, 131–156 (1997)CrossRefGoogle Scholar
  13. 13.
    Monteiro, S.T., Kosugi, Y.: Particle Swarms for Feature Extraction of Hyperspectral Data. IEICE Trans. Inf. and Syst. E90D(7), 1038–1046 (2007)CrossRefGoogle Scholar
  14. 14.
    Moreno, J.F.: Sen2flex data acquisition report, Universidad de Valencia, Tech. Rep. (2005)Google Scholar
  15. 15.
    Ney, H.: On the relationship between classification error bounds and training criteria in statistical pattern recognition. In: Perales, F.J., Campilho, A.C., Pérez, N., Sanfeliu, A. (eds.) IbPRIA 2003. LNCS, vol. 2652, pp. 636–645. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  16. 16.
    Pudil, P., Ferri, F.J., Novovicova, J., Kittler, J.: Floating search methods for feature selection with nonmonotonic criterion functions. Pattern Recognition 2, 279–283 (1994)Google Scholar
  17. 17.
    Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27, 832–837 (1956)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Sotoca, J.M., Pla, F.: Supervised feature selection by clustering using conditional mutual information-based distances. Pattern Recognition 43(6), 2068–2081 (2010)CrossRefzbMATHGoogle Scholar
  19. 19.
    Verleysen, M., Rossi, F., François, D.: Advances in Feature Selection with Mutual Information. In: Biehl, M., Hammer, B., Verleysen, M., Villmann, T. (eds.) Similarity-Based Clustering. LNCS, vol. 5400, pp. 52–69. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  20. 20.
    Ward, J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Statist. Soc. B 67(part 2), 301–320 (2005)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Pedro Latorre Carmona
    • 1
  • José M. Sotoca
    • 1
  • Filiberto Pla
    • 1
  • Frederick K. H. Phoa
    • 2
  • José Bioucas Dias
    • 3
  1. 1.Dept. Lenguajes y Sistemas InformáticosJaume I UniversitySpain
  2. 2.Institute of Statistical Science.Academia Sinica.R.O.C.
  3. 3.Instituto de Telecomunicações and Instituto Superior TécnicoTechnical University of LisbonPortugal

Personalised recommendations