Abstract
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF’s internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.
Similar content being viewed by others
References
Ambroise, C., & McLachlan, G.J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences, 99(10), 6562–6566.
Babyak, M.A. (2004). What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic Medicine, 66(3), 411–421.
Biau, G. (2012). Analysis of a random forests model. Journal of Machine Learning Research, 13 (Apr), 1063–1095.
Boulesteix, A.-L., Janitza, S., Kruppa, J., & König, I.R. (2012). Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6), 493–507.
Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140.
Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6), 2350–2383.
Breiman, L. (1998). Arcing classifiers. The Annals of Statistics, 26(3), 801–849.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Burnham, K.P., & Anderson, D.R. (2002). Model selection and multimodel inference: a practical information-theoretic approach, 2nd edn. New York: Springer-Verlag.
Carlisle, D.M., Falcone, J., & Meador, M.R. (2009). Predicting the biological condition of streams: use of geospatial indicators of natural and anthropogenic characteristics of watersheds. Environmental Monitoring and Assessment, 151(1), 143–160.
Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data. Technical report. Berkeley: University of California. http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf.
Cutler, D.R., Edwards, T.C. Jr, Beard, K.H., Cutler, A., Hess, K.T., Gibson, J., & Lawler, J.J. (2007). Random forests for classification in ecology. Ecology, 88(11), 2783– 2792.
De’ath, G., & Fabricius, K.E. (2000). Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81(11), 3178–3192.
Díaz-Uriarte, R., & De Andres, S.A (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 1–13.
Evans, J.S., & Cushman, S.A. (2009). Gradient modeling of conifer species using random forests. Landscape Ecology, 24(5), 673–683.
Evans, J.S., Murphy, M.A., Holden, Z.A., & Cushman, S.A. (2011). Modeling species distribution and change using random forest. In Drew, C., Wiersma, Y., & Huettman, F. (Eds.), Predictive species and habitat modeling in landscape ecology (pp. 139–159). New York: Springer.
Faraway, J.J. (2005). Linear models with R. Boca Raton, Fl: CRC Press.
Freeman, E.A., Moisen, G.G., & Frescino, T.S. (2012). Evaluating effectiveness of down-sampling for stratified designs and unbalanced prevalence in Random Forest models of tree species distributions in Nevada. Ecological Modelling, 233, 1–10.
Freeman, E.A., Moisen, G.G., Coulston, J.W., & Wilson, B.T. (2015). Random forests and stochastic gradient boosting for predicting tree canopy cover: comparing tuning processes and model performance. Canadian Journal of Forest Research, 45, 1–17.
Genuer, R., Poggi, J.-M., & Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14), 2225–2236.
Gislason, P.O., Benediktsson, J.A., & Sveinsson, J.R. (2006). Random forests for land cover classification. Pattern Recognition Letters, 27(4), 294–300.
Goldstein, B.A., Hubbard, A.E., Cutler, A., & Barcellos, L.F. (2010). An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genetics, 11(1), 1–13.
Goldstein, B.A., Polley, E.C., & Briggs, F. (2011). Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology, 10(1), 32.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer New York: Springer Series in Statistics.
Hill, R.A., Hawkins, C.P., & Carlisle, D.M. (2013). Predicting thermal reference conditions for USA streams and rivers. Freshwater Science, 32(1), 39–55.
Hill, R.A., Weber, M.H., Leibowitz, S.G., Olsen, A.R., & Thornbrugh, D.J. (2016). The Stream-Catchment (StreamCat) dataset: a database of watershed metrics for the conterminous United States. Journal of the American Water Resources Association, 52(1), 120–128.
Hill, R.A., Fox, E.W., Leibowitz, S.G., Olsen, A.R., Thornbrugh, D.J., & Weber, M.H. (2017). Predictive mapping of the biotic condition of conterminous-USA rivers and streams. Submitted to Ecological Applications.
Hosmer, D.W., & Lemeshow, S. (2000). Applied logistic regression, 2nd edn. New York: John Wiley & Sons.
Khalilia, M., Chakraborty, S., & Popescu, M. (2011). Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making, 11(1), 51.
Khoshgoftaar, T.M., Golawala, M., & Van Hulse, J. (2007). An empirical study of learning from imbalanced data using random forest. In: 19Th IEEE international conference on tools with artificial intelligence, (Vol. 2 pp. 310–317).
Lawrence, R.L., Wood, S.D., & Sheley, R.L. (2006). Mapping invasive plants using hyperspectral imagery and Breiman Cutler classifications (RandomForest). Remote Sensing of Environment, 100(3), 356–362.
Leisch, F., & Dimitriadou, E. (2010). mlbench: machine learning benchmark problems. R package version 2.1-1.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18–22.
McKay, L., Bondelid, T., Dewald, T., Johnston, J., Moore, R., & Rea, A. (2012). NHDPlus Version 2: User Guide. U.S. Environmental Protection Agency. Available from: http://www.horizon-systems.com/NHDPlus/NHDPlusV2_home.php.
Omernik, J.M. (1987). Ecoregions of the conterminous United States. Annals of the Association of American Geographers, 77(1), 118–125.
Prasad, A.M., Iverson, L.R., & Liaw, A. (2006). Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems, 9(2), 181–199.
R Core Team. (2014). R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.
Rehfeldt, G.E., Crookston, N.L., Sáenz-Romero, C., & Campbell, E.M. (2012). North American vegetation model for land-use planning in a changing climate: a solution to large classification problems. Ecological Applications, 22(1), 119–141.
Segal, M.R. (2004). Machine learning benchmarks and random forest regression. Technical report, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco. https://escholarship.org/uc/item/35x3v9t4.
Stoddard, J.L., Herlihy, A.T., Peck, D.V., Hughes, R.M., Whittier, T.R., & Tarquinio, E. (2008). A process for creating multimetric indices for large-scale aquatic surveys. Journal of the North American Benthological Society, 27(4), 878–891.
Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323–348.
Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., & Feuston, B.P. (2003). Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947–1958.
U.S. Environmental Protection Agency. (2016a). National rivers and streams assessment 2008-2009: a collaborative survey (EPA/841/r-16/007). Washington, D.C.: Office of Water and Office of Research and Development.
U.S. Environmental Protection Agency. (2016b). National rivers and streams assessment 2008-2009 technical report (EPA/841/r-16/008). Washington, D.C.: Office of Water and Office of Research and Development.
Acknowledgments
We thank Brian Gray (USGS, Upper Midwest Environmental Science Center) and Kathi Irvine (USGS, Northern Rockies Science Center) for providing valuable comments that improved this paper. We also thank Rick Debbout (CSRA Inc.) for assistance in developing many of the geospatial indicators used in this study. The information in this document was funded by the U.S. Environmental Protection Agency, in part through an appointment to the Internship/Research Participation Program at the Office of Research and Development, U.S. Environmental Protection Agency, administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and EPA. The manuscript has been subjected to review by the Western Ecology Division of ORD’s National Health and Environmental Effects Research Laboratory and approved for publication. Approval does not signify that the contents reflect the views of the Agency, nor does mention of trade names or commercial products constitute endorsement or recommendation for use. The data from the 2008-2009 NRSA used in this paper resulted from the collective efforts of dedicated field crews, laboratory staff, data management and quality control staff, analysts, and many others from EPA, states, tribes, federal agencies, universities, and other organizations. For questions about these data, please contact nars-hq@epa.gov.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Fox, E.W., Hill, R.A., Leibowitz, S.G. et al. Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology. Environ Monit Assess 189, 316 (2017). https://doi.org/10.1007/s10661-017-6025-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10661-017-6025-0