Skip to main content
Log in

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology

  • Published:
Environmental Monitoring and Assessment Aims and scope Submit manuscript

Abstract

Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF’s internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Ambroise, C., & McLachlan, G.J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences, 99(10), 6562–6566.

    Article  CAS  Google Scholar 

  • Babyak, M.A. (2004). What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic Medicine, 66(3), 411–421.

    Google Scholar 

  • Biau, G. (2012). Analysis of a random forests model. Journal of Machine Learning Research, 13 (Apr), 1063–1095.

    Google Scholar 

  • Boulesteix, A.-L., Janitza, S., Kruppa, J., & König, I.R. (2012). Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6), 493–507.

    Google Scholar 

  • Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140.

    Google Scholar 

  • Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6), 2350–2383.

    Article  Google Scholar 

  • Breiman, L. (1998). Arcing classifiers. The Annals of Statistics, 26(3), 801–849.

    Article  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  Google Scholar 

  • Burnham, K.P., & Anderson, D.R. (2002). Model selection and multimodel inference: a practical information-theoretic approach, 2nd edn. New York: Springer-Verlag.

    Google Scholar 

  • Carlisle, D.M., Falcone, J., & Meador, M.R. (2009). Predicting the biological condition of streams: use of geospatial indicators of natural and anthropogenic characteristics of watersheds. Environmental Monitoring and Assessment, 151(1), 143–160.

    Article  Google Scholar 

  • Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data. Technical report. Berkeley: University of California. http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf.

    Google Scholar 

  • Cutler, D.R., Edwards, T.C. Jr, Beard, K.H., Cutler, A., Hess, K.T., Gibson, J., & Lawler, J.J. (2007). Random forests for classification in ecology. Ecology, 88(11), 2783– 2792.

    Article  Google Scholar 

  • De’ath, G., & Fabricius, K.E. (2000). Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81(11), 3178–3192.

    Article  Google Scholar 

  • Díaz-Uriarte, R., & De Andres, S.A (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 1–13.

    Article  Google Scholar 

  • Evans, J.S., & Cushman, S.A. (2009). Gradient modeling of conifer species using random forests. Landscape Ecology, 24(5), 673–683.

    Article  Google Scholar 

  • Evans, J.S., Murphy, M.A., Holden, Z.A., & Cushman, S.A. (2011). Modeling species distribution and change using random forest. In Drew, C., Wiersma, Y., & Huettman, F. (Eds.), Predictive species and habitat modeling in landscape ecology (pp. 139–159). New York: Springer.

    Chapter  Google Scholar 

  • Faraway, J.J. (2005). Linear models with R. Boca Raton, Fl: CRC Press.

    Google Scholar 

  • Freeman, E.A., Moisen, G.G., & Frescino, T.S. (2012). Evaluating effectiveness of down-sampling for stratified designs and unbalanced prevalence in Random Forest models of tree species distributions in Nevada. Ecological Modelling, 233, 1–10.

    Article  Google Scholar 

  • Freeman, E.A., Moisen, G.G., Coulston, J.W., & Wilson, B.T. (2015). Random forests and stochastic gradient boosting for predicting tree canopy cover: comparing tuning processes and model performance. Canadian Journal of Forest Research, 45, 1–17.

    Article  Google Scholar 

  • Genuer, R., Poggi, J.-M., & Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14), 2225–2236.

    Article  Google Scholar 

  • Gislason, P.O., Benediktsson, J.A., & Sveinsson, J.R. (2006). Random forests for land cover classification. Pattern Recognition Letters, 27(4), 294–300.

    Article  Google Scholar 

  • Goldstein, B.A., Hubbard, A.E., Cutler, A., & Barcellos, L.F. (2010). An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genetics, 11(1), 1–13.

    Article  Google Scholar 

  • Goldstein, B.A., Polley, E.C., & Briggs, F. (2011). Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology, 10(1), 32.

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer New York: Springer Series in Statistics.

    Book  Google Scholar 

  • Hill, R.A., Hawkins, C.P., & Carlisle, D.M. (2013). Predicting thermal reference conditions for USA streams and rivers. Freshwater Science, 32(1), 39–55.

    Article  Google Scholar 

  • Hill, R.A., Weber, M.H., Leibowitz, S.G., Olsen, A.R., & Thornbrugh, D.J. (2016). The Stream-Catchment (StreamCat) dataset: a database of watershed metrics for the conterminous United States. Journal of the American Water Resources Association, 52(1), 120–128.

    Article  Google Scholar 

  • Hill, R.A., Fox, E.W., Leibowitz, S.G., Olsen, A.R., Thornbrugh, D.J., & Weber, M.H. (2017). Predictive mapping of the biotic condition of conterminous-USA rivers and streams. Submitted to Ecological Applications.

  • Hosmer, D.W., & Lemeshow, S. (2000). Applied logistic regression, 2nd edn. New York: John Wiley & Sons.

    Book  Google Scholar 

  • Khalilia, M., Chakraborty, S., & Popescu, M. (2011). Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making, 11(1), 51.

    Article  Google Scholar 

  • Khoshgoftaar, T.M., Golawala, M., & Van Hulse, J. (2007). An empirical study of learning from imbalanced data using random forest. In: 19Th IEEE international conference on tools with artificial intelligence, (Vol. 2 pp. 310–317).

  • Lawrence, R.L., Wood, S.D., & Sheley, R.L. (2006). Mapping invasive plants using hyperspectral imagery and Breiman Cutler classifications (RandomForest). Remote Sensing of Environment, 100(3), 356–362.

    Article  Google Scholar 

  • Leisch, F., & Dimitriadou, E. (2010). mlbench: machine learning benchmark problems. R package version 2.1-1.

  • Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18–22.

    Google Scholar 

  • McKay, L., Bondelid, T., Dewald, T., Johnston, J., Moore, R., & Rea, A. (2012). NHDPlus Version 2: User Guide. U.S. Environmental Protection Agency. Available from: http://www.horizon-systems.com/NHDPlus/NHDPlusV2_home.php.

  • Omernik, J.M. (1987). Ecoregions of the conterminous United States. Annals of the Association of American Geographers, 77(1), 118–125.

    Article  Google Scholar 

  • Prasad, A.M., Iverson, L.R., & Liaw, A. (2006). Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems, 9(2), 181–199.

    Article  Google Scholar 

  • R Core Team. (2014). R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.

    Google Scholar 

  • Rehfeldt, G.E., Crookston, N.L., Sáenz-Romero, C., & Campbell, E.M. (2012). North American vegetation model for land-use planning in a changing climate: a solution to large classification problems. Ecological Applications, 22(1), 119–141.

    Article  Google Scholar 

  • Segal, M.R. (2004). Machine learning benchmarks and random forest regression. Technical report, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco. https://escholarship.org/uc/item/35x3v9t4.

  • Stoddard, J.L., Herlihy, A.T., Peck, D.V., Hughes, R.M., Whittier, T.R., & Tarquinio, E. (2008). A process for creating multimetric indices for large-scale aquatic surveys. Journal of the North American Benthological Society, 27(4), 878–891.

    Article  Google Scholar 

  • Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323–348.

    Article  Google Scholar 

  • Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., & Feuston, B.P. (2003). Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947–1958.

    Article  CAS  Google Scholar 

  • U.S. Environmental Protection Agency. (2016a). National rivers and streams assessment 2008-2009: a collaborative survey (EPA/841/r-16/007). Washington, D.C.: Office of Water and Office of Research and Development.

    Google Scholar 

  • U.S. Environmental Protection Agency. (2016b). National rivers and streams assessment 2008-2009 technical report (EPA/841/r-16/008). Washington, D.C.: Office of Water and Office of Research and Development.

    Google Scholar 

Download references

Acknowledgments

We thank Brian Gray (USGS, Upper Midwest Environmental Science Center) and Kathi Irvine (USGS, Northern Rockies Science Center) for providing valuable comments that improved this paper. We also thank Rick Debbout (CSRA Inc.) for assistance in developing many of the geospatial indicators used in this study. The information in this document was funded by the U.S. Environmental Protection Agency, in part through an appointment to the Internship/Research Participation Program at the Office of Research and Development, U.S. Environmental Protection Agency, administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and EPA. The manuscript has been subjected to review by the Western Ecology Division of ORD’s National Health and Environmental Effects Research Laboratory and approved for publication. Approval does not signify that the contents reflect the views of the Agency, nor does mention of trade names or commercial products constitute endorsement or recommendation for use. The data from the 2008-2009 NRSA used in this paper resulted from the collective efforts of dedicated field crews, laboratory staff, data management and quality control staff, analysts, and many others from EPA, states, tribes, federal agencies, universities, and other organizations. For questions about these data, please contact nars-hq@epa.gov.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eric W. Fox.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 807 KB)

(PDF 90.3 KB)

(PDF 429 KB)

(PDF 1.82 MB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fox, E.W., Hill, R.A., Leibowitz, S.G. et al. Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology. Environ Monit Assess 189, 316 (2017). https://doi.org/10.1007/s10661-017-6025-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10661-017-6025-0

Keywords

Navigation