Abstract
We propose probabilistic framework for analysis of inaccuracies due to feature selection (FS) when flawed estimates of performance of feature subsets are utilized. The approach is based on analysis of random search FS procedure and postulation that joint distribution of true and estimated classification errors is known a priori. We derive expected values for the FS bias, a difference between actual classification error after FS and classification error if ideal FS is performed according to exact estimates. The increase in true classification error due to inaccurate FS is comparable or even exceeds a training bias, a difference between generalization and Bayes errors. We have shown that there exists overfitting phenomenon in feature selection, entitled in this paper as feature over-selection. The effects of feature over-selection could be reduced if FS would be performed on basis of positional statistics. Theoretical results are supported by experiments carried out on simulated Gaussian data, as well as on high dimensional microarray gene expression data.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Hughes, G.F.: On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory IT-14, 55–63 (1965)
Raudys, S.: On the problems of sample size in pattern recognition. In: Pugatchiov, V.S. (ed.) Detection, Pattern Recognition and Experiment Design. Proceedings of the 2nd All-Union Conference Statistical Methods in Control Theory, Nauka, Moscow, vol. 2, pp. 64–76 (1970) (in Russian)
Kanal, L., Chandrasekaran, B.: On dimensionality and sample size in statistical pattern classification. Pattern Recognition 3, 238–255 (1971)
Raudys, S.: Statistical and Neural Classifiers - An integrated approach to design. Springer, London (2001)
Haykin, S.: Neural Networks: A comprehensive foundation, 2nd edn. Prentice-Hall, Englewood Cliffs (1999)
Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Acad. Press, San Diego (1990)
Raudys, S.J., Jain, A.K.: Small sample size effects in statistical pattern recognition: Recommendation for practitioners. IEEE Trans. on Pattern Analysis and Machine Intelligence 13(3), 242–254 (1991)
Pudil, P., Novovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognition Letters 15(11), 1119–1125 (1994)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. of Machine Learning Research 3, 1157–1182 (2003)
Raudys, S.: Classification errors when features are selected. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 38, pp. 9–26. Institute of Mathematics and Informatics, Vilnius (1979) (in Russian)
Raudys, S.: Influence of sample size on the accuracy of model selection in pattern recognition. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 50, pp. 9–30. Institute of Mathematics and Informatics, Vilnius (1981) (in Russian)
Murray, G.D.: A cautionary note on selection of variables in discriminant analysis. Appl. Statist. 26(3), 246–250 (1997)
Ng, A.: Preventing overfitting of cross-validation data. In: Proc. of the Fourteenth International Conference on Machine Learning, pp. 245–253. Morgan Kaufman, San Francisco (1997)
Ye, J.: On measuring and correcting the effects of data mining and model selection. J. of American Statistical Association 93(441), 120–131 (1998)
Domingos, P.: Process-oriented estimation of generalization error. In: Proceedings of the Sixteenth International, Joint Conf. on Art. Intell., pp. 714–722. Morgan Kaufmann, San Francisco (1999)
Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99(10), 6562–6566 (2002)
Golub, T.R., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Raudys, S. (2006). Feature Over-Selection. In: Yeung, DY., Kwok, J.T., Fred, A., Roli, F., de Ridder, D. (eds) Structural, Syntactic, and Statistical Pattern Recognition. SSPR /SPR 2006. Lecture Notes in Computer Science, vol 4109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11815921_68
Download citation
DOI: https://doi.org/10.1007/11815921_68
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37236-3
Online ISBN: 978-3-540-37241-7
eBook Packages: Computer ScienceComputer Science (R0)