Variable Selection for Classification and Regression in Large p, Small n Problems

Loh, Wei-Yin

doi:10.1007/978-1-4614-1966-2_10

Variable Selection for Classification and Regression in Large p, Small n Problems

Wei-Yin Loh⁴

Conference paper
First Online: 01 January 2011

1640 Accesses
8 Citations

Part of the book series: Lecture Notes in Statistics ((LNSP,volume 205))

Abstract

Classification and regression problems in which the number of predictor variables is larger than the number of observations are increasingly common with rapid technological advances in data collection. Because some of these variables may have little or no influence on the response, methods that can identify the unimportant variables are needed. Two methods that have been proposed for this purpose are EARTH and Random forest (RF). This article presents an alternative method, derived from the GUIDE classification and regression tree algorithm, that employs recursive partitioning to determine the degree of importance of the variables. Simulation experiments show that the new method improves the prediction accuracy of several nonparametric regression models more than Random forest and EARTH. The results indicate that it is not essential to correctly identify all the important variables in every situation. Conditions for which this occurs are obtained for the linear model. The article concludes with an application of the new method to identify rare molecules in a large genomic data set.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Breiman L (2001) Random forests. Mach Learn 45:5–32
Article MATH Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
MATH Google Scholar
Chernoff H, Lo S-H, Zheng T (2009) Discovering influential variables: a method of partitions. Ann Appl Stat 3:1335–1369
Article MATH MathSciNet Google Scholar
Doksum K, Tang S, Tsui K-W (2008) Nonparametric variable selection: the EARTH algorithm. J Am Stat Assoc 103:1609–1620
Article MATH MathSciNet Google Scholar
Friedman J (1991) Multivariate adaptive regression splines (with discussion). Ann Stat 19:1–141
Article MATH Google Scholar
Loh W-Y (2002) Regression trees with unbiased variable selection and interaction detection. Stat Sin 12:361–386
MATH MathSciNet Google Scholar
Loh W-Y (2009) Improving the precision of classification trees. Ann Appl Stat 3:1710–1737
Article MATH MathSciNet Google Scholar
Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biometrics Bull 2:110–114
Article Google Scholar
Seber GAF, Lee AJ (2003) Linear regression analysis. 2nd edn. Wiley, New York
Book MATH Google Scholar
Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinf 8:25
Article Google Scholar
Tuv E, Borisov A, Torkkola K (2006) Feature selection using ensemble based ranking against artificial contrasts. In: IJCNN ’06. International joint conference on neural networks, Vancouver, Canada
Google Scholar

Download references

Acknowledgements

This research was partially supported by the U.S. Army Research Office under grants W911NF-05-1-0047 and W911NF-09-1-0205. The author is grateful to K. Doksum, S. Tang, and K. Tsui for helpful discussions and to S. Tang for the computer code for EARTH.

Author information

Authors and Affiliations

University of Wisconsin, Madison, WI, 53706, USA
Wei-Yin Loh

Authors

Wei-Yin Loh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei-Yin Loh .

Editor information

Editors and Affiliations

Institut für Mathematik, Universität Zürich, Winterthurerstr. 190, Zürich, 8057, Switzerland
Andrew Barbour
, Department of Statistics and Applied Pro, National University of Singapore, Singapore, 119260, Singapore
Hock Peng Chan
Dept. Statistics, Stanford University, Serra Mall 390, Stanford, 94305, California, USA
David Siegmund

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Loh, WY. (2012). Variable Selection for Classification and Regression in Large p, Small n Problems. In: Barbour, A., Chan, H., Siegmund, D. (eds) Probability Approximations and Beyond. Lecture Notes in Statistics(), vol 205. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1966-2_10

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1966-2_10
Published: 07 December 2011
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1965-5
Online ISBN: 978-1-4614-1966-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics