We propose probabilistic framework for analysis of inaccuracies due to feature selection (FS) when flawed estimates of performance of feature subsets are utilized. The approach is based on analysis of random search FS procedure and postulation that joint distribution of true and estimated classification errors is known a priori. We derive expected values for the FS bias, a difference between actual classification error after FS and classification error if ideal FS is performed according to exact estimates. The increase in true classification error due to inaccurate FS is comparable or even exceeds a training bias, a difference between generalization and Bayes errors. We have shown that there exists overfitting phenomenon in feature selection, entitled in this paper as feature over-selection. The effects of feature over-selection could be reduced if FS would be performed on basis of positional statistics. Theoretical results are supported by experiments carried out on simulated Gaussian data, as well as on high dimensional microarray gene expression data.
KeywordsFeature Selection Classification Error Feature Subset Positional Statistic Generalization Error
Unable to display preview. Download preview PDF.
- 1.Hughes, G.F.: On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory IT-14, 55–63 (1965)Google Scholar
- 2.Raudys, S.: On the problems of sample size in pattern recognition. In: Pugatchiov, V.S. (ed.) Detection, Pattern Recognition and Experiment Design. Proceedings of the 2nd All-Union Conference Statistical Methods in Control Theory, Nauka, Moscow, vol. 2, pp. 64–76 (1970) (in Russian) Google Scholar
- 6.Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Acad. Press, San Diego (1990)Google Scholar
- 10.Raudys, S.: Classification errors when features are selected. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 38, pp. 9–26. Institute of Mathematics and Informatics, Vilnius (1979) (in Russian)Google Scholar
- 11.Raudys, S.: Influence of sample size on the accuracy of model selection in pattern recognition. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 50, pp. 9–30. Institute of Mathematics and Informatics, Vilnius (1981) (in Russian)Google Scholar
- 13.Ng, A.: Preventing overfitting of cross-validation data. In: Proc. of the Fourteenth International Conference on Machine Learning, pp. 245–253. Morgan Kaufman, San Francisco (1997)Google Scholar
- 15.Domingos, P.: Process-oriented estimation of generalization error. In: Proceedings of the Sixteenth International, Joint Conf. on Art. Intell., pp. 714–722. Morgan Kaufmann, San Francisco (1999)Google Scholar