A new feature selection approach based on ensemble methods in semi-supervised classification

Settouti, Nesma; Chikh, Mohamed Amine; Barra, Vincent

doi:10.1007/s10044-015-0524-9

A new feature selection approach based on ensemble methods in semi-supervised classification

Theoretical Advances
Published: 03 November 2015

Volume 20, pages 673–686, (2017)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Nesma Settouti^1,2,3,
Mohamed Amine Chikh³ &
Vincent Barra^1,2

743 Accesses
7 Citations
Explore all metrics

Abstract

In computer aided medical system, many practical classification applications are confronted to the massive multiplication of collection and storage of data, this is especially the case in areas such as the prediction of medical test efficiency, the classification of tumors and the detection of cancers. Data with known class labels (labeled data) can be limited but unlabeled data (with unknown class labels) are more readily available. Semi-supervised learning deals with methods for exploiting the unlabeled data in addition to the labeled data to improve performance on the classification task. In this paper, we consider the problem of using a large amount of unlabeled data to improve the efficiency of feature selection in large dimensional datasets, when only a small set of labeled examples is available. We propose a new semi-supervised feature evaluation method called Optimized co-Forest for Feature Selection (OFFS) that combines ideas from co-forest and the embedded principle of selecting in Random Forest based by the permutation of out-of-bag set. We provide empirical results on several medical and biological benchmark datasets, indicating an overall significant improvement of OFFS compared to four other feature selection approaches using filter, wrapper and embedded manner in semi-supervised learning. Our method proves its ability and effectiveness to select and measure importance to improve the performance of the hypothesis learned with a small amount of labeled samples by exploiting unlabeled samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Analysis of Feature Selection Methods for Classification of Healthcare Datasets

Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications

Article Open access 02 November 2017

3-3FS: ensemble method for semi-supervised multi-label feature selection

Article 28 October 2021

Notes

A bootstrap sample L is, for example, obtained by randomly drawing n observations with replacement from the training sample \(L_n\) each observation with a probability 1/n to be drawn.

References

Aha DW, Bankert RL (1996) A comparative evaluation of sequential feature selection algorithms. In: Fisher DH, Lenz HJ (eds) Learning from data: artificial intelligence and Statistics V, Lecture Notes in Statistics, chap 4, pp 199–206. Springer-Verlag, 175 Fifth Avenue, New York, 10010, USA
Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9(7):1545–1588
Article Google Scholar
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. Trans Neural Netw 5(4):537–550. doi:10.1109/72.298224
Article Google Scholar
Bellal F, Elghazel H, Aussem A (2012) A semi-supervised feature ranking method with ensemble learning. Pattern Recogn Lett 33(10):1426–1432. doi:10.1016/j.patrec.2012.03.001
Article Google Scholar
Benabdeslem K, Hindawi M (2013) Efficient semi-supervised feature selection: constraint, relevance and redundancy. IEEE Trans Knowl Data Eng 26(5):1131–1143
Article Google Scholar
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory. COLT’ 98NY, USA, New York, pp 92–100
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi:10.1023/A:1018054314350
MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32
Article MATH Google Scholar
Cai D, He X, Zhou K, Han J, Bao H (2007) Locality sensitive discriminant analysis. In: Proceedings of the 20th international joint conference on artificial intelligence, IJCAI’07, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 708–713
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27. doi:10.1145/1961189.1961199
Article Google Scholar
Cheng Y, Cai Y, Sun Y, Li J (2008) Semi-supervised feature selection under logistic i-relief framework. In: ICPR IEEE, pp 1–4
Cun YL, Denker JS, Solla SA (1990) Optimal brain damage. In: Advances in neural information processing systems, pp 598–605. Morgan Kaufmann
Deng C, Guo M (2011) A new co-training-style random forest for computer aided diagnosis. J Intell Inf Syst 36(3):253–281
Article Google Scholar
Doquire G, Verleysen M (2011) Graph laplacian for semi-supervised feature selection in regression problems. In: Cabestany J, Rojas I, Caparrs GJ (eds) IWANN (1), Lecture Notes in Computer Science, vol 6691. Springer, pp 248–255
Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
MathSciNet MATH Google Scholar
Eiger AM, Nadler B, Spiegelman C (2013) The calibrated kolmogorov–smirnov test. http://arxiv.org/abs/1311.3190. Cite arxiv:1311.3190
Elghazel H, Aussem A (2010) Feature selection for unsupervised learning using random cluster ensembles. In: 2013 IEEE 13th international conference on data mining, pp 168–175
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10):906–914 (Evaluation Studies)
Goldberg DE, Deb K (1991) A comparative analysis of selection schemes used in genetic algorithms. In: Foundations of genetic algorithms. Morgan Kaufmann, pp 69–93
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Han J, Kamber M, Pei J (2011) Data Mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco
MATH Google Scholar
Hindawi M, Benabdeslem K (2013) Local-to-global semi-supervised feature selection. In: He Q, Iyengar A, Nejdl W, Pei J, Rastogi R (eds) CIKM. ACM, pp 2159–2168
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844. doi:10.1109/34.709601
Article Google Scholar
Hong Y, Kwong S, Chang Y, Ren Q (2008) Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm. Pattern Recogn 41(9):2742–2756. doi:10.1016/j.patcog.2008.03.007
Article MATH Google Scholar
Hua J, Tembe WD, Dougherty ER (2009) Performance of feature-selection methods in the classification of high-dimension data. Pattern Recogn 42(3):409–424. doi:10.1016/j.patcog.2008.08.001
Article MATH Google Scholar
John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Machine learning: proceedings of the eleventh international. Morgan Kaufmann, pp 121–129
Kallakech M, Biela P, Macaire L, Hamad D (2011) Constraint scores for semi-supervised feature selection: a comparative study. Pattern Recogn Lett 32(5):656–665
Article Google Scholar
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning, ML92. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 249–256
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324. doi:10.1016/S0004-3702(97)00043-X
Article MATH Google Scholar
Kong X, Yu PS (2010) Semi-supervised feature selection for graph classification. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10. ACM, New York, NY, USA, pp 793–802. doi:10.1145/1835804.1835905
Kuncheva LI (2007) A stability index for feature selection. In: Proceedings of the 25th conference on proceedings of the 25th IASTED international multi-conference: artificial intelligence and applications, AIAP’07. ACTA Press, Anaheim, CA, USA, pp 390–395
Leskes B, Torenvliet L (2008) The value of agreement a new boosting algorithm. J Comput Syst Sci 74(4):557–586. doi:10.1016/j.jcss.2007.06.005
Article MathSciNet MATH Google Scholar
Li M, Zhou ZH (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. Trans Sys Man Cyber Part A 37(6):1088–1098. doi:10.1109/TSMCA.2007.904745
Article Google Scholar
Liu H, Motoda H (1998) Feature extraction, construction and selection: a data mining perspective. Kluwer Academic Publishers, Norwell
Book MATH Google Scholar
Liu H, Motoda H (2007) Computational methods of feature selection (Chapman & Hall/Crc Data Mining and Knowledge Discovery Series). Chapman & Hall/CRC
Mitchell TM (1999) The role of unlabeled data in supervised learning. In: Proceedings of the sixth international colloquium on cognitive science. San Sebastian, Spain
Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312. doi:10.1109/34.990133
Article Google Scholar
Miyahara K, Pazzani MJ (2000) Collaborative filtering with the simple bayesian classifier. In: Proceedings of the 6th Pacific Rim international conference on artificial intelligence, PRICAI’00. Springer-Verlag, Berlin, pp 679–689
Nakatani Y, Zhu K, Uehara K (2007) Semisupervised learning using feature selection based on maximum density subgraphs. Syst Comput Jpn 38(9):32–43. doi:10.1002/scj.20757
Article Google Scholar
Newman D, Hettich S, Blake C, Merz C (1998) Uci repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html
Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the ninth international conference on information and knowledge management, CIKM ’00. ACM, New York, NY, USA, pp 86–93. doi:10.1145/354756.354805
Press WH, Teukolsky SA (1992) In: Vetterling WT, Flannery BP (eds) Numerical recipes in C (2nd ed.): the art of scientific computing. Cambridge University Press, New York
Ren J, Qiu Z, Fan W, Cheng H, Yu PS (2008) Forward semi-supervised feature selection. In: Proceedings of the 12th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’08. Springer-Verlag, Berlin, pp 970–976
Saeys Y, Inza In, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517. doi:10.1093/bioinformatics/btm344
Stoppiglia H, Dreyfus G, Dubois R, Oussar Y (2003) Ranking a random feature for variable and feature selection. J Mach Learn Res 3:1399–1414
MATH Google Scholar
Sun D, Zhang D (2010) Bagging constraint score for feature selection with pairwise constraints. Pattern Recogn 43(6):2106–2118. doi:10.1016/j.patcog.2009.12.011
Article MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc (Ser B) 58:267–288
MathSciNet MATH Google Scholar
Torkkola K (2003) Feature extraction by non parametric mutual information maximization. J Mach Learn Res 3:1415–1438
MathSciNet MATH Google Scholar
Wang J, Luo S, Zeng X (2008) A random subspace method for co-training. In: IJCNN, IEEE, pp 195–200
Xu Z, King I, Lyu MR, Jin R (2010) Discriminative semi-supervised feature selection via manifold regularization. IEEE Trans Neural Netw 21(7):1033–1047
Article Google Scholar
Yaslan Y, Cataltepe Z (2010) Co-training with relevant random subspaces. Neurocomputing 73(10–12):1652–1661. doi:10.1016/j.neucom.2010.01.018
Article Google Scholar
Zafarani R, Liu H (1998) Asu repository of social computing databases. http://socialcomputing.asu.edu/pages/datasets
Zhao J, Lu K, He X (2008) Locality sensitive semi-supervised feature selection. Neurocomputing 71(10–12):1842–1849. doi:10.1016/j.neucom.2007.06.014
Article Google Scholar
Zhao Z, Liu H (2007) Semi-supervised feature selection via spectral analysis. In: SDM, SIAM
Zhou Y, Goldman S (2004) Democratic co-learning. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence, ICTAI ’04, IEEE Computer Society, Washington, DC, USA, pp 594–202. doi:10.1109/ICTAI.2004.48
Zhou ZH, Li M (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl Data Eng 17(11):1529–1541. doi:10.1109/TKDE.2005.186
Zhu X (2005) Semi-Supervised learning literature survey. Computer Sciences, University of Wisconsin-Madison, Tech. rep

Download references

Author information

Authors and Affiliations

LIMOS, CNRS, UMR 6158, 63173, Aubiere, France
Nesma Settouti & Vincent Barra
LIMOS, Clermont-Université Université Blaise Pascal, BP 10448, 63000, Clermont-Ferrand, France
Nesma Settouti & Vincent Barra
Biomedical Engineering Laboratory GBM, Tlemcen University, Tlemcen, Algeria
Nesma Settouti & Mohamed Amine Chikh

Authors

Nesma Settouti
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Amine Chikh
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Barra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nesma Settouti.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Settouti, N., Chikh, M. & Barra, V. A new feature selection approach based on ensemble methods in semi-supervised classification. Pattern Anal Applic 20, 673–686 (2017). https://doi.org/10.1007/s10044-015-0524-9

Download citation

Received: 24 June 2015
Accepted: 20 October 2015
Published: 03 November 2015
Issue Date: August 2017
DOI: https://doi.org/10.1007/s10044-015-0524-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new feature selection approach based on ensemble methods in semi-supervised classification

Abstract

Access this article

Similar content being viewed by others

Performance Analysis of Feature Selection Methods for Classification of Healthcare Datasets

Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications

3-3FS: ensemble method for semi-supervised multi-label feature selection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A new feature selection approach based on ensemble methods in semi-supervised classification

Abstract

Access this article

Similar content being viewed by others

Performance Analysis of Feature Selection Methods for Classification of Healthcare Datasets

Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications

3-3FS: ensemble method for semi-supervised multi-label feature selection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation