Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology

Fox, Eric W.; Hill, Ryan A.; Leibowitz, Scott G.; Olsen, Anthony R.; Thornbrugh, Darren J.; Weber, Marc H.

doi:10.1007/s10661-017-6025-0

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology

Published: 06 June 2017

Volume 189, article number 316, (2017)
Cite this article

Environmental Monitoring and Assessment Aims and scope Submit manuscript

Eric W. Fox¹,
Ryan A. Hill²,
Scott G. Leibowitz¹,
Anthony R. Olsen¹,
Darren J. Thornbrugh²^nAff3 &
…
Marc H. Weber¹

3984 Accesses
109 Citations
Explore all metrics

Abstract

Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF’s internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Article 19 April 2016

A comparative analysis of machine learning algorithms for predicting wave runup

Article Open access 18 December 2023

A Review on Random Forest: An Ensemble Classifier

References

Ambroise, C., & McLachlan, G.J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences, 99(10), 6562–6566.
Article CAS Google Scholar
Babyak, M.A. (2004). What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic Medicine, 66(3), 411–421.
Google Scholar
Biau, G. (2012). Analysis of a random forests model. Journal of Machine Learning Research, 13 (Apr), 1063–1095.
Google Scholar
Boulesteix, A.-L., Janitza, S., Kruppa, J., & König, I.R. (2012). Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6), 493–507.
Google Scholar
Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140.
Google Scholar
Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6), 2350–2383.
Article Google Scholar
Breiman, L. (1998). Arcing classifiers. The Annals of Statistics, 26(3), 801–849.
Article Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article Google Scholar
Burnham, K.P., & Anderson, D.R. (2002). Model selection and multimodel inference: a practical information-theoretic approach, 2nd edn. New York: Springer-Verlag.
Google Scholar
Carlisle, D.M., Falcone, J., & Meador, M.R. (2009). Predicting the biological condition of streams: use of geospatial indicators of natural and anthropogenic characteristics of watersheds. Environmental Monitoring and Assessment, 151(1), 143–160.
Article Google Scholar
Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data. Technical report. Berkeley: University of California. http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf.
Google Scholar
Cutler, D.R., Edwards, T.C. Jr, Beard, K.H., Cutler, A., Hess, K.T., Gibson, J., & Lawler, J.J. (2007). Random forests for classification in ecology. Ecology, 88(11), 2783– 2792.
Article Google Scholar
De’ath, G., & Fabricius, K.E. (2000). Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81(11), 3178–3192.
Article Google Scholar
Díaz-Uriarte, R., & De Andres, S.A (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 1–13.
Article Google Scholar
Evans, J.S., & Cushman, S.A. (2009). Gradient modeling of conifer species using random forests. Landscape Ecology, 24(5), 673–683.
Article Google Scholar
Evans, J.S., Murphy, M.A., Holden, Z.A., & Cushman, S.A. (2011). Modeling species distribution and change using random forest. In Drew, C., Wiersma, Y., & Huettman, F. (Eds.), Predictive species and habitat modeling in landscape ecology (pp. 139–159). New York: Springer.
Chapter Google Scholar
Faraway, J.J. (2005). Linear models with R. Boca Raton, Fl: CRC Press.
Google Scholar
Freeman, E.A., Moisen, G.G., & Frescino, T.S. (2012). Evaluating effectiveness of down-sampling for stratified designs and unbalanced prevalence in Random Forest models of tree species distributions in Nevada. Ecological Modelling, 233, 1–10.
Article Google Scholar
Freeman, E.A., Moisen, G.G., Coulston, J.W., & Wilson, B.T. (2015). Random forests and stochastic gradient boosting for predicting tree canopy cover: comparing tuning processes and model performance. Canadian Journal of Forest Research, 45, 1–17.
Article Google Scholar
Genuer, R., Poggi, J.-M., & Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14), 2225–2236.
Article Google Scholar
Gislason, P.O., Benediktsson, J.A., & Sveinsson, J.R. (2006). Random forests for land cover classification. Pattern Recognition Letters, 27(4), 294–300.
Article Google Scholar
Goldstein, B.A., Hubbard, A.E., Cutler, A., & Barcellos, L.F. (2010). An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genetics, 11(1), 1–13.
Article Google Scholar
Goldstein, B.A., Polley, E.C., & Briggs, F. (2011). Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology, 10(1), 32.
Article Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer New York: Springer Series in Statistics.
Book Google Scholar
Hill, R.A., Hawkins, C.P., & Carlisle, D.M. (2013). Predicting thermal reference conditions for USA streams and rivers. Freshwater Science, 32(1), 39–55.
Article Google Scholar
Hill, R.A., Weber, M.H., Leibowitz, S.G., Olsen, A.R., & Thornbrugh, D.J. (2016). The Stream-Catchment (StreamCat) dataset: a database of watershed metrics for the conterminous United States. Journal of the American Water Resources Association, 52(1), 120–128.
Article Google Scholar
Hill, R.A., Fox, E.W., Leibowitz, S.G., Olsen, A.R., Thornbrugh, D.J., & Weber, M.H. (2017). Predictive mapping of the biotic condition of conterminous-USA rivers and streams. Submitted to Ecological Applications.
Hosmer, D.W., & Lemeshow, S. (2000). Applied logistic regression, 2nd edn. New York: John Wiley & Sons.
Book Google Scholar
Khalilia, M., Chakraborty, S., & Popescu, M. (2011). Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making, 11(1), 51.
Article Google Scholar
Khoshgoftaar, T.M., Golawala, M., & Van Hulse, J. (2007). An empirical study of learning from imbalanced data using random forest. In: 19Th IEEE international conference on tools with artificial intelligence, (Vol. 2 pp. 310–317).
Lawrence, R.L., Wood, S.D., & Sheley, R.L. (2006). Mapping invasive plants using hyperspectral imagery and Breiman Cutler classifications (RandomForest). Remote Sensing of Environment, 100(3), 356–362.
Article Google Scholar
Leisch, F., & Dimitriadou, E. (2010). mlbench: machine learning benchmark problems. R package version 2.1-1.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18–22.
Google Scholar
McKay, L., Bondelid, T., Dewald, T., Johnston, J., Moore, R., & Rea, A. (2012). NHDPlus Version 2: User Guide. U.S. Environmental Protection Agency. Available from: http://www.horizon-systems.com/NHDPlus/NHDPlusV2_home.php.
Omernik, J.M. (1987). Ecoregions of the conterminous United States. Annals of the Association of American Geographers, 77(1), 118–125.
Article Google Scholar
Prasad, A.M., Iverson, L.R., & Liaw, A. (2006). Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems, 9(2), 181–199.
Article Google Scholar
R Core Team. (2014). R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.
Google Scholar
Rehfeldt, G.E., Crookston, N.L., Sáenz-Romero, C., & Campbell, E.M. (2012). North American vegetation model for land-use planning in a changing climate: a solution to large classification problems. Ecological Applications, 22(1), 119–141.
Article Google Scholar
Segal, M.R. (2004). Machine learning benchmarks and random forest regression. Technical report, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco. https://escholarship.org/uc/item/35x3v9t4.
Stoddard, J.L., Herlihy, A.T., Peck, D.V., Hughes, R.M., Whittier, T.R., & Tarquinio, E. (2008). A process for creating multimetric indices for large-scale aquatic surveys. Journal of the North American Benthological Society, 27(4), 878–891.
Article Google Scholar
Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323–348.
Article Google Scholar
Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., & Feuston, B.P. (2003). Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947–1958.
Article CAS Google Scholar
U.S. Environmental Protection Agency. (2016a). National rivers and streams assessment 2008-2009: a collaborative survey (EPA/841/r-16/007). Washington, D.C.: Office of Water and Office of Research and Development.
Google Scholar
U.S. Environmental Protection Agency. (2016b). National rivers and streams assessment 2008-2009 technical report (EPA/841/r-16/008). Washington, D.C.: Office of Water and Office of Research and Development.
Google Scholar

Download references

Acknowledgments

We thank Brian Gray (USGS, Upper Midwest Environmental Science Center) and Kathi Irvine (USGS, Northern Rockies Science Center) for providing valuable comments that improved this paper. We also thank Rick Debbout (CSRA Inc.) for assistance in developing many of the geospatial indicators used in this study. The information in this document was funded by the U.S. Environmental Protection Agency, in part through an appointment to the Internship/Research Participation Program at the Office of Research and Development, U.S. Environmental Protection Agency, administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and EPA. The manuscript has been subjected to review by the Western Ecology Division of ORD’s National Health and Environmental Effects Research Laboratory and approved for publication. Approval does not signify that the contents reflect the views of the Agency, nor does mention of trade names or commercial products constitute endorsement or recommendation for use. The data from the 2008-2009 NRSA used in this paper resulted from the collective efforts of dedicated field crews, laboratory staff, data management and quality control staff, analysts, and many others from EPA, states, tribes, federal agencies, universities, and other organizations. For questions about these data, please contact nars-hq@epa.gov.

Author information

Darren J. Thornbrugh
Present address: National Park Service, Northern Great Plains Network, 231 East St. Joseph St., Rapid City, SD, 55701, USA

Authors and Affiliations

National Health and Environmental Effects Research Laboratory, Western Ecology Division, U.S. Environmental Protection Agency, 200 SW 35th St., Corvallis, OR, 97333, USA
Eric W. Fox, Scott G. Leibowitz, Anthony R. Olsen & Marc H. Weber
c/o National Health and Environmental Effects Research Laboratory, Western Ecology Division, U.S. Environmental Protection Agency, Oak Ridge Institute for Science and Education (ORISE) Post-doctoral Participant, 200 SW 35th St., Corvallis, OR, 97333, USA
Ryan A. Hill & Darren J. Thornbrugh

Authors

Eric W. Fox
View author publications
You can also search for this author in PubMed Google Scholar
Ryan A. Hill
View author publications
You can also search for this author in PubMed Google Scholar
Scott G. Leibowitz
View author publications
You can also search for this author in PubMed Google Scholar
Anthony R. Olsen
View author publications
You can also search for this author in PubMed Google Scholar
Darren J. Thornbrugh
View author publications
You can also search for this author in PubMed Google Scholar
Marc H. Weber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eric W. Fox.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 807 KB)

(PDF 90.3 KB)

(PDF 429 KB)

(PDF 1.82 MB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fox, E.W., Hill, R.A., Leibowitz, S.G. et al. Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology. Environ Monit Assess 189, 316 (2017). https://doi.org/10.1007/s10661-017-6025-0

Download citation

Received: 21 December 2016
Accepted: 25 May 2017
Published: 06 June 2017
DOI: https://doi.org/10.1007/s10661-017-6025-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A comparative analysis of machine learning algorithms for predicting wave runup

A Review on Random Forest: An Ensemble Classifier

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

(PDF 807 KB)

(PDF 90.3 KB)

(PDF 429 KB)

(PDF 1.82 MB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A comparative analysis of machine learning algorithms for predicting wave runup

A Review on Random Forest: An Ensemble Classifier

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

(PDF 807 KB)

(PDF 90.3 KB)

(PDF 429 KB)

(PDF 1.82 MB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation