Skip to main content

Mutual information for feature selection: estimation or counting?


In classification, feature selection is an important pre-processing step to simplify the dataset and improve the data representation quality, which makes classifiers become better, easier to train, and understand. Because of an ability to analyse non-linear interactions between features, mutual information has been widely applied to feature selection. Along with counting approaches, a traditional way to calculate mutual information, many mutual information estimations have been proposed to allow mutual information to work directly on continuous datasets. This work focuses on comparing the effect of counting approach and kernel density estimation (KDE) approach in feature selection using particle swarm optimisation as a search mechanism. The experimental results on 15 different datasets show that KDE can work well on both continuous and discrete datasets. In addition, feature subsets evolved by KDE achieves similar or better classification performance than the counting approach. Furthermore, the results on artificial datasets with various interactions show that KDE is able to capture correctly the interaction between features, in both relevance and redundancy, which can not be achieved by using the counting approach.

This is a preview of subscription content, access via your institution.

Fig. 1


  1. 1.

    Tang J, Alelyani S, Liu H (2014) Feature selection for classification: a review. Data Classif Algorithms Appl 2014:37

    MathSciNet  Google Scholar 

  2. 2.

    Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2:37–52

    Article  Google Scholar 

  3. 3.

    Lee TW (1998) Independent component analysis. Springer, US, pp 27–66

  4. 4.

    Whitney AW (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput 20(9):1100–1103. doi:10.1109/T-C.1971.223410

    MathSciNet  Article  MATH  Google Scholar 

  5. 5.

    Marill T, Green DM (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inf Theory 9:11–17

    Article  Google Scholar 

  6. 6.

    Xue B, Zhang M, Browne W, Yao X (2015) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput. doi:10.1109/TEVC.2015.250442

    Google Scholar 

  7. 7.

    Eberhart RC, Shi Y (1998) Comparison between genetic algorithms and particle swarm optimization. In: Porto VW, Saravanan N, Waagen D, Eiben AE (eds) Proceedings of the 7th international conference on evolutionary programming VII. Lecture notes in computer science, vol 1447. Springer, Berlin, Heidelberg, pp 611–616

  8. 8.

    Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1:131–156

    Article  Google Scholar 

  9. 9.

    Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324

    Article  MATH  Google Scholar 

  10. 10.

    Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, New York

    MATH  Google Scholar 

  11. 11.

    Dash M, Liu H, Motoda H (2000) Consistency Based Feature Selection. In: Takao T, Liu H, Chen ALP (eds) Knowledge discovery and data mining. current issues and new applications. Lecture notes in computer science, vol 1805. Springer, Berlin, Heidelberg, pp 98–109

  12. 12.

    Hall M (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of 7th intentional conference on machine learning, Stanford University (2000)

  13. 13.

    Kononenko I (1995) On biases in estimating multi-valued attributes. IJCAI 95:1034–1040

    Google Scholar 

  14. 14.

    Walters-Williams J, Li Y (2009) Estimation of mutual information: a survey. In: Wen P, Li Y, Polkowski L, Yao Y, Tsumoto S, Wang G (eds) Rough sets and knowledge technology, Springer, Heidelberg, pp 389–396. doi:10.1007/978-3-642-02962-2_49

    Chapter  Google Scholar 

  15. 15.

    Nguyen HB, Xue B, Andreae P (2016) Mutual information estimation for filter based feature selection using particle swarm optimization. In: Applications of evolutionary computation. Springer (2016) 719–736

  16. 16.

    Kennedy J, Eberhart R et al (1995) Particle swarm optimization. In: Proceedings of IEEE international conference on neural networks, vol 4, Perth, Australia, pp 1942–1948

  17. 17.

    Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106:620

    MathSciNet  Article  MATH  Google Scholar 

  18. 18.

    Alfonso L, Lobbrecht A, Price R (2010) Optimization of water level monitoring network in polder systems using information theory. Water Resources Research 46 (2010)

  19. 19.

    Stearns SD (1976) On selecting features for pattern classifiers. In: Proceedings of the 3rd international conference on pattern recognition (ICPR 1976), Coronado, CA, pp 71–75

  20. 20.

    Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recognit Lett 15:1119–1125

    Article  Google Scholar 

  21. 21.

    Xue B, Zhang M, Browne WN (2014) Particle swarm optimisation for feature selection in classification: novel initialisation and updating mechanisms. Appl Soft Comput 18:261–276

    Article  Google Scholar 

  22. 22.

    Bharti KK, Singh PK (2016) Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering. Appl Soft Comput 43:20–34

    Article  Google Scholar 

  23. 23.

    Vieira SM, Mendonça LF, Farinha GJ, Sousa JM (2013) Modified binary PSO for feature selection using svm applied to mortality prediction of septic patients. Appl Soft Comput 13:3494–3504

    Article  Google Scholar 

  24. 24.

    Chuang LY, Chang HW, Tu CJ, Yang CH (2008) Improved binary PSO for feature selection using gene expression data. Comput Biol Chem 32:29–38

    Article  MATH  Google Scholar 

  25. 25.

    Lee S, Soak S, Oh S, Pedrycz W, Jeon M (2008) Modified binary particle swarm optimization. Prog Nat Sci 18:1161–1166

    MathSciNet  Article  Google Scholar 

  26. 26.

    Huang CL, Wang CJ (2006) A ga-based feature selection and parameters optimizationfor support vector machines. Expert Syst Appl 31:231–240

    Article  Google Scholar 

  27. 27.

    Lane MC, Xue B, Liu I, Zhang M (2013) Particle swarm optimisation and statistical clustering for feature selection. In: AI 2013: advances in artificial intelligence. Springer, pp 214–220

  28. 28.

    Lane MC, Xue B, Liu I, Zhang M (2014) Gaussian based particle swarm optimisation and statistical clustering for feature selection. In: Evolutionary computation in combinatorial optimisation. Lecture notes in computer science, vol 8600. Springer, Heidelberg, pp 133–144. doi:10.1007/978-3-662-44320-0_12

  29. 29.

    Nguyen HB, Xue B, Liu I, Zhang M (2014) PSO and statistical clustering for feature selection: a new representation. In: Dick G, Browne WN, Whigham P, Zhang M, Bui LT, Ishibuchi BH, Jin Y, Li X, Shi Y, Singh P, Tan KC, Tang K (eds) Simulated evolution and learning, vol 8886. Springer International Publishing, Heidelberg, pp 569–581. doi:10.1007/978-3-319-13563-2_481

  30. 30.

    Nguyen HB, Xue B, Liu I, Andreae P, Zhang M (2015) Gaussian transformation based representation in particle swarm optimisation for feature selection. In: Mora AM, Squillero G (eds) Applications of evolutionary computation, vol 9028. Springer International Publishing, pp 541–553. doi:10.1007/978-3-319-16549-3_44

  31. 31.

    Tran B, Xue B, Zhang M (2014) Improved PSO for feature selection on high-dimensional datasets. In: Dick G, Browne WN, Whigham P, Zhang M, Bui LT, Ishibuchi BH, Jin Y, Li X, Shi Y, Singh P, Tan KC, Tang K (eds) Simulated evolution and learning. Lecture notes in computer science, vol 8886. Springer International Publishing, pp 503–515

  32. 32.

    Ghamisi P, Benediktsson JA (2015) Feature selection based on hybridization of genetic algorithm and particle swarm optimization. IEEE Geosci Rem Sens Lett 12:309–313

    Article  Google Scholar 

  33. 33.

    Freeman C, Kulić D, Basir O (2015) An evaluation of classifier-specific filter measure performance for feature selection. Pattern Recognit 48:1812–1826

    Article  Google Scholar 

  34. 34.

    Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238

    Article  Google Scholar 

  35. 35.

    Estévez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20:189–201

    Article  Google Scholar 

  36. 36.

    Hoque N, Bhattacharyya D, Kalita JK (2014) Mifs-nd: a mutual information-based feature selection method. Expert Syst Appl 41:6371–6385

    Article  Google Scholar 

  37. 37.

    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11:10–18

    Article  Google Scholar 

  38. 38.

    Lee J, Kim DW (2015) Mutual information-based multi-label feature selection using interaction information. Expert Syst Appl 42:2013–2025

    Article  Google Scholar 

  39. 39.

    Lee J, Kim DW (2013) Feature selection for multi-label classification using multivariate mutual information. Pattern Recognit Lett 34:349–357

    Article  Google Scholar 

  40. 40.

    Fang L, Zhao H, Wang P, Yu M, Yan J, Cheng W, Chen P (2015) Feature selection method based on mutual information and class separability for dimension reduction in multidimensional time series for clinical data. Biomed Signal Process Control 21:82–89

    Article  Google Scholar 

  41. 41.

    Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Phys Rev E 69:066138

    MathSciNet  Article  Google Scholar 

  42. 42.

    Cervante L, Xue B, Zhang M, Shang L (2012) Binary particle swarm optimisation for feature selection: a filter based approach. In: 2012 IEEE congress on evolutionary computation (CEC). IEEE (2012)

  43. 43.

    Xue B, Cervante L, Shang L, Browne WN, Zhang M (2012) A multi-objective particle swarm optimisation for filter-based feature selection in classification problems. Connect Sci 24:91–116

    Article  Google Scholar 

  44. 44.

    Nguyen HB, Xue B, Liu I, Zhang M (2014) Filter based backward elimination in wrapper based PSO for feature selection in classification. In: IEEE congress on evolutionary computation (CEC), Beijing, pp 3111–3118. doi:10.1109/CEC.2014.6900657

  45. 45.

    Sturges HA (1926) The choice of a class interval. J Am Stat Assoc 21:65–66

    Article  Google Scholar 

  46. 46.

    Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 1962:1065–1076

    MathSciNet  Article  MATH  Google Scholar 

  47. 47.

    Lizier JT (2014) Jidt: an information-theoretic toolkit for studying the dynamics of complex systems. arXiv preprint arXiv:1408.3270

  48. 48.

    Asuncion A, Newman D (2007) Uci machine learning repository (2007)

  49. 49.

    Lungarella M, Pegors T, Bulwinkle D, Sporns O (2005) Methods for quantifying the informational structure of sensory and motor data. Neuroinformatics 3:243–262

    Article  Google Scholar 

  50. 50.

    Van Den Bergh F (2006) An analysis of particle swarm optimizers. PhD thesis, University of Pretoria (2006)

  51. 51.

    Xue B, Zhang M, Browne WN (2013) Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Trans Cybern 43:1656–1671

    Article  Google Scholar 

  52. 52.

    Eberhart RC, Shi Y (2000) Comparing inertia weights and constriction factors in particle swarm optimization evolutionary computation. In: Proceedings of the 2000 Congress on, La Jolla, CA, vol 1, pp 84–88. doi:10.1109/CEC.2000.870279

  53. 53.

    Moraglio A, Di Chio C, Poli R (2007) Geometric Particle Swarm Optimisation. In: Ebner M, O’Neill M, Ekárt A, Vanneschi L, Esparcia-Alcázar AI (eds) Genetic Programming, vol 4445. Springer, Berlin, Heidelberg, pp 125–136. doi:10.1007/978-3-540-71605-1_12

Download references


We thank A/Prof Ivy Liu from School of Mathematics and Statistics,Victoria University of Wellington, who provided insight into ANOVA test and helped us to analyse the experimental results.

Author information



Corresponding author

Correspondence to Hoai Bach Nguyen.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nguyen, H.B., Xue, B. & Andreae, P. Mutual information for feature selection: estimation or counting?. Evol. Intel. 9, 95–110 (2016).

Download citation


  • Mutual information
  • Feature selection
  • Classification
  • Particle swarm optimisation