Improving performance for classification with incomplete data using wrapper-based feature selection

Tran, Cao Truong; Zhang, Mengjie; Andreae, Peter; Xue, Bing

doi:10.1007/s12065-016-0141-6

Improving performance for classification with incomplete data using wrapper-based feature selection

Special Issue
Published: 09 August 2016

Volume 9, pages 81–94, (2016)
Cite this article

Evolutionary Intelligence Aims and scope Submit manuscript

Cao Truong Tran ORCID: orcid.org/0000-0002-6323-4387¹,
Mengjie Zhang¹,
Peter Andreae¹ &
…
Bing Xue¹

945 Accesses
22 Citations
Explore all metrics

Abstract

Missing values are an unavoidable problem of many real-world datasets. Inadequate treatment of missing values may result in large errors on classification; thus, dealing well with missing values is essential for classification. Feature selection has been well known for improving classification, but it has been seldom used for improving classification with incomplete datasets. Moreover, some classifiers such as C4.5 are able to directly classify incomplete datasets, but they often generate more complex classifiers with larger classification errors. The purpose of this paper is to propose a wrapper-based feature selection method to improve the ability of a classifier able to classify incomplete datasets. In order to achieve the purpose, the feature selection method evaluates feature subsets using a classifier able to classify incomplete datasets. Empirical results on 14 datasets using particle swarm optimisation for searching feature subsets and C4.5 for evaluating the feature subsets in the feature selection method show that the wrapper-based feature selection is not only able to improve classification accuracy of the classifier, but also able to reduce the size of trees generated by the classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Vitor Werner de Vargas, Jorge Arthur Schneider Aranda, … Jorge Luis Victória Barbosa

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A review of unsupervised feature selection methods

Article 29 January 2019

Saúl Solorio-Fernández, J. Ariel Carrasco-Ochoa & José Fco. Martínez-Trinidad

References

Lichman M (2013) UCI Machine Learning Repository. School of Information and Computer Science, University of California, Irvine, CA. http://archive.ics.uci.edu/ml
Barnard J, Meng X-L (1999) Applications of multiple imputation in medical studies: from aids to nhanes. Stat Methods Med Res 8:17–36
Article Google Scholar
Batista GE, Monard MC (2002) A study of K-nearest neighbour as an imputation method. HIS 87:251–260
Google Scholar
Berger JO (2013) Statistical decision theory and Bayesian analysis. Springer, New York
Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Chuang L-Y, Chang H-W, Tu C-J, Yang C-H (2008) Improved binary pso for feature selection using gene expression data. Comput Biol Chem 32:29–38
Article MATH Google Scholar
Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3:261–283
Google Scholar
Clerc M, Kennedy J (2002) The particle swarm-explosion, stability, and convergence in a multidimensional complex space. IEEE Trans Evol Comput 6:58–73
Article Google Scholar
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1:131–156
Article Google Scholar
De’ath G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81:3178–3192
Article Google Scholar
Doquire G, Verleysen M (2012) Feature selection with missing data using mutual information estimators. Neurocomputing 90:3–11
Article Google Scholar
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41:3692–3705
Article MATH Google Scholar
Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern A Syst Hum 37:692–709
Article Google Scholar
García S, Molina D, Lozano M, Herrera F (2009) A study on the use of non-parametric tests for analyzing the evolutionary algorithms behaviour: a case study on the cec2005 special session on real parameter optimization. J Heuristics 15:617–644
Article MATH Google Scholar
García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19:263–282
Article Google Scholar
Graham JW (2009) Missing data analysis: making it work in the real world. Annu Rev Psychol 60:549–576
Article Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11:10–18
Article Google Scholar
Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato
Han J, Kamber M, Pei J (2006) Data mining, southeast asia edition: concepts and techniques. Morgan kaufmann, San Francisco
MATH Google Scholar
Huang C-L, Dun J-F (2008) A distributed pso-svm hybrid system with feature selection and parameter optimization. Appl Soft Comput 8:1381–1391
Article Google Scholar
Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell 19:153–158
Article Google Scholar
Kennedy J (2010) Particle swarm optimization. In: Encyclopedia of machine learning, pp 760–766
Kennedy J, Kennedy JF, Eberhart RC (2001) Swarm intelligence. Morgan Kaufmann, San Francisco
Google Scholar
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324
Article MATH Google Scholar
Koller D, Sahami M (1995) Toward optimal feature selection. In: 13th international conference on machine learning, pp 284–292
Lane MC, Xue B, Liu I, Zhang M (2014) Gaussian based particle swarm optimisation and statistical clustering for feature selection. In: European conference on evolutionary computation in combinatorial optimization, pp 133–144
Lin S-W, Ying K-C, Chen S-C, Lee Z-J (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 35:1817–1824
Article Google Scholar
Little RJ, Rubin DB (2014) Statistical analysis with missing data. Wiley, Hoboken
MATH Google Scholar
MacKay DJ (2003) Information theory, inference, and learning algorithms, vol 7. Citeseer
Oh I-S, Lee J-S, Moon B-R (2004) Hybrid genetic algorithms for feature selection. IEEE Trans Pattern Anal Mach Intell 26:1424–1437
Article Google Scholar
Qian W, Shu W (2015) Mutual information criterion for feature selection from incomplete data. Neurocomputing 168:210–220
Article Google Scholar
Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier, Amsterdam
Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. CRC Press, Boca Raton
Book MATH Google Scholar
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7:147
Article Google Scholar
Tran CT, Andreae P, Zhang M (2015) Impact of imputation of missing values on genetic programming based multiple feature construction for classification. In: 2015 IEEE congress on evolutionary computation (CEC), pp 2398–2405
Tran CT, Zhang M, Andreae P (2015) Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp 583–590
Tran CT, Zhang M, Andreae P (2016) A genetic programming-based imputation method for classification with missing data. In: European conference on genetic programming. Springer, pp 149–163
Tran CT, Zhang M, Andreae P, Xue B (2016) A wrapper feature selection approach to classification with missing data. In: Applications of evolutionary computation, pp 685–700
Xue B, Zhang M, Browne W, Yao X (2015) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput 99:1
Google Scholar
Xue B, Zhang M, Browne WN (2012) Single feature ranking and binary particle swarm optimisation based feature subset ranking for feature selection. In: Proceedings of the thirty-fifth Australasian computer science conference, vol 122, pp 27–36
Xue B, Zhang M, Browne WN (2013) Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Trans Cybern 43:1656–1671
Article Google Scholar
Xue B, Zhang M, Browne WN (2015) A comprehensive comparison on evolutionary feature selection approaches to classification. Int J Comput Intell Appl 14:1550008
Article Google Scholar
Yang J, Honavar V (1998) Feature subset selection using a genetic algorithm. In: Feature extraction, construction and selection, pp 117–136

Download references

Author information

Authors and Affiliations

Evolutionary Computation Research Group, School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington, 6140, New Zealand
Cao Truong Tran, Mengjie Zhang, Peter Andreae & Bing Xue

Authors

Cao Truong Tran
View author publications
You can also search for this author in PubMed Google Scholar
Mengjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Peter Andreae
View author publications
You can also search for this author in PubMed Google Scholar
Bing Xue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cao Truong Tran.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tran, C.T., Zhang, M., Andreae, P. et al. Improving performance for classification with incomplete data using wrapper-based feature selection. Evol. Intel. 9, 81–94 (2016). https://doi.org/10.1007/s12065-016-0141-6

Download citation

Received: 29 April 2016
Revised: 13 July 2016
Accepted: 21 July 2016
Published: 09 August 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s12065-016-0141-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving performance for classification with incomplete data using wrapper-based feature selection

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A review of unsupervised feature selection methods

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving performance for classification with incomplete data using wrapper-based feature selection

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A review of unsupervised feature selection methods

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation