On Feature Selection, Bias-Variance, and Bagging

  • M. Arthur Munson
  • Rich Caruana
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5782)


We examine the mechanism by which feature selection improves the accuracy of supervised learning. An empirical bias/variance analysis as feature selection progresses indicates that the most accurate feature set corresponds to the best bias-variance trade-off point for the learning algorithm. Often, this is not the point separating relevant from irrelevant features, but where increasing variance outweighs the gains from adding more (weakly) relevant features. In other words, feature selection can be viewed as a variance reduction method that trades off the benefits of decreased variance (from the reduction in dimensionality) with the harm of increased bias (from eliminating some of the relevant features). If a variance reduction method like bagging is used, more (weakly) relevant features can be exploited and the most accurate feature set is usually larger. In many cases, the best performance is obtained by using all available features.


Feature Selection Mean Square Error Feature Subset Irrelevant Feature Corruption Level 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ali, K.M., Pazzani, M.J.: Error reduction through learning multiple descriptions. Machine Learning 24(3), 173–202 (1996)Google Scholar
  2. 2.
    Bay, S.D.: Combining nearest neighbor classifiers through multiple feature subsets. In: ICML 1998: Proceedings of the 15th International Conference on Machine Learning, pp. 37–45. Morgan Kaufmann Publishers Inc., San Francisco (1998)Google Scholar
  3. 3.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)zbMATHGoogle Scholar
  4. 4.
    Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997)CrossRefzbMATHGoogle Scholar
  5. 5.
    Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. In: Advances in Neural Information Processing Systems 17, pp. 545–552. MIT Press, Cambridge (2005)Google Scholar
  6. 6.
    Reunanen, J.: Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research 3, 1371–1382 (2003)zbMATHGoogle Scholar
  7. 7.
    Loughrey, J., Cunningham, P.: Using early-stopping to avoid overfitting in wrapper-based feature selection employing stochastic search. Technical Report TCD-CS-2005-37, Trinity College Dublin, Department of Computer Science (May 2005)Google Scholar
  8. 8.
    van der Putten, P., van Someren, M.: A bias-variance analysis of a real world learning problem: The CoIL challenge 2000. Machine Learning 57(1-2), 177–195 (2004)CrossRefzbMATHGoogle Scholar
  9. 9.
    Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)zbMATHGoogle Scholar
  10. 10.
    Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning 36(1-2), 105–139 (1999)CrossRefGoogle Scholar
  11. 11.
    Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  12. 12.
    Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 832–844 (1998)CrossRefGoogle Scholar
  13. 13.
    Bryll, R., Gutierrez-Osuna, R., Quek, F.: Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition 36(6), 1291–1302 (2003)CrossRefzbMATHGoogle Scholar
  14. 14.
    Opitz, D.W.: Feature selection for ensembles. In: AAAI 1999: Proceedings of the 16th National Conference on Artificial Intelligence, pp. 379–384. American Association for Artificial Intelligence, Menlo Park (1999)Google Scholar
  15. 15.
    Tuv, E., Borisov, A., Torkkola, K.: Feature selection using ensemble based ranking against artificial contrasts. In: International Joint Conference on Neural Networks, pp. 2181–2186 (2006)Google Scholar
  16. 16.
    Saeys, Y., Abeel, T., Peer, Y.: Robust feature selection using ensemble feature selection techniques. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 313–325. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  17. 17.
    Tuv, E.: Ensemble learning. In: Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.) Feature Extraction: Foundations, and Applications. Studies in Fuzziness and Soft Computing, vol. 207, pp. 187–204. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  18. 18.
    Buntine, W., Caruana, R.: Introduction to IND and recursive partitioning. Technical Report FIA-91-28, NASA Ames Research Center (October 1991)Google Scholar
  19. 19.
    Wallace, C.S., Patrick, J.D.: Coding decision trees. Machine Learning 11(1), 7–22 (1993)CrossRefzbMATHGoogle Scholar
  20. 20.
    Buntine, W.: Learning classification trees. Statistics and Computing 2(2), 63–73 (1992)CrossRefGoogle Scholar
  21. 21.
    Platt, J.C.: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Smola, A.J., Bartlett, P.J., Schoelköpf, B., Schuurmans, D. (eds.) Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (2000)Google Scholar
  22. 22.
    Asuncion, A., Newman, D.: UCI machine learning repository (2007)Google Scholar
  23. 23.
    Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (1992)CrossRefGoogle Scholar
  24. 24.
    Domingos, P.: A unified bias-variance decomposition and its applications. In: Proceedings of the 17th International Conference on Machine Learning, pp. 231–238. Morgan Kaufmann, San Francisco (2000)Google Scholar
  25. 25.
    Bouckaert, R.R.: Practical bias variance decomposition. In: Wobcke, W., Zhang, M. (eds.) AI 2008. LNCS (LNAI), vol. 5360, pp. 247–257. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  26. 26.
    Kohavi, R., Wolpert, D.H.: Bias plus variance decomposition for zero-one loss functions. In: Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (1996)Google Scholar
  27. 27.
    Caruana, R., de Sa, V.R.: Benefitting from the variables that variable selection discards. Journal of Machine Learning Research 3, 1245–1264 (2003)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • M. Arthur Munson
    • 1
  • Rich Caruana
    • 2
  1. 1.Cornell UniversityIthaca NYUSA
  2. 2.Microsoft CorporationUSA

Personalised recommendations