1 Introduction

A large proportion of the Earth’s surface has been transformed by anthropogenic land use activities in recent centuries. Land use and land cover change (hereafter, LUCC) was once considered a local environmental issue, but is becoming globally important due to its increasingly widespread effects upon natural environments (Foley 2005; Lambin and Geist 2006). Comprehending these effects requires, in part, the understanding of relationships between variations in socioeconomic (hereafter, SE) and biogeophysical (hereafter, BGP) factors associated with the LUCC with which they co-occur (Geist and Lambin 2001, 2002). However, understanding these relationships is difficult because LUCC is a result of complex interactions among social, economic and environmental factors acting across different scales of space and time (Geist and Lambin 2001, 2002; Caldas et al. 2013). Therefore, it is necessary to design studies carefully so that inferences are reliable. Unreliable conclusions can lead to distorted management recommendations, resulting in missed conservation opportunities, and a waste of resources and time (Oliveira et al. 2017).

Several studies have investigated relationships between LUCC and a wide variety of socioeconomic and environmental factors. LUCC are commonly expressed in terms of deforestation rates (DF) and forest fragmentation (FF) metrics. Examples of these multiscale and multifactor dynamics influencing LUCC patterns are: the increasing demand for food and other commodities (Aide and Grau 2004; DeFries et al. 2004; Barbier et al. 2010; Caldas et al. 2013), shifts in regional economies changing household level conditions (Perz 2004; Richards et al. 2008; Wright and Samaniego 2008; Gaughan et al. 2009), indirect effects of tourism (Gaughan et al. 2009), globalization of markets (Hecht et al. 2006; Parés-Ramos et al. 2008), and the presence and effectiveness of social institutions (Hecht et al. 2006; Richards et al. 2008).

Studies addressing the impacts of LUCC upon tropical systems have increased significantly in recent decades (Malhi et al. 2014). Those impacts have been separated into two different types: underlying (or indirect) and proximate (or immediate) causes (Geist and Lambin 2002). Proximate causes are human actions that directly affect these changes, while underlying causes affect these changes indirectly (Geist and Lambin 2002). The main recognized proximate causes of LUCC in tropical countries are: agricultural expansion (e.g., shifting cultivation and permanent cultivation), cattle ranching and infrastructure expansion (e.g., transportation infrastructure) (Pfaff 1999; Geist and Lambin 2001; Perz et al. 2007). Furthermore, LUCC is also influenced by the underlying drivers, especially demographic dynamics (e.g. population growth) and economic factors (e.g. local or international demand for commodities) (Geist and Lambin 2001; Caldas et al. 2013). In many regions, there is a clear relationship between population change and LUCC (Geist and Lambin 2002). However, other studies have shown that LUCC can be modified by socioeconomic and environmental factors (Geist and Lambin 2002).

A few studies have attempted to investigate drivers and associated factors of land use and cover changes in the Brazilian Atlantic Forest. Silva et al. (2007) conducted a local scale study and found an indirect influence of topographic relief on forest cover. Teixeira et al. (2009) showed that proximate causes influence the dynamics of deforestation and forest re-growth. They identified that losses in young secondary vegetation and forest were far from rivers, on gentle slopes and near urban areas, while higher forest re-growth rates were near rivers, on steep slopes and far from dirt roads. Freitas et al. (2010) analysed the effects of roads, topography and land use on forest cover dynamics, and demonstrated that forest dynamics were directly related to past road density, past land use (buildings and agriculture expansion), and slope variation. Lira et al. (2012) described LUCC in three Atlantic Forest fragmented landscapes (in São Paulo state) over time and found that LUCC deviated from a random trajectory. Their results also suggested a forest transition in some Atlantic Forest regions. Freitas et al. (2013) used a combination of statistical approaches—multivariate data analysis (CCA), linear regression models (OLS), local spatial regression models (GWR) and spatial clustering procedures (SKATER)—to investigate relationships between LUCC processes and environmental and socioeconomic factors in an Atlantic Forest region with an area of \(\sim \)12,000 \(\hbox {km}^{2}\) in the state of Rio Grande do Sul. Their findings revealed a competitive and inter-related set of LUCC processes, due to the landscape complexity. More recently, Ferreira et al. (2015) investigated how forest cover and agricultural land use varied in an area of Atlantic Forest in São Paulo state, emphasizing sugarcane expansion. Besides, a general trend of decline followed by stabilization of forest remnants in this biome may be assumed due to different deforestation rates in the Brazilian states (SOS Mata Atlântica/INPE 2014). However, there are discrepancies between data sets provided by different organizations, which are necessary to understand the landscape dynamics (Farinaci and Batistella 2012).

LUCC studies have used a range of statistical techniques. Some studies have used relatively simplistic approaches, such as Mann–Whitney and Kruskal–Wallis tests (Quezada et al. 2013), or correlation analyses (Beilin et al. 2014). Others have applied more robust approaches, combining or comparing different methods, such as statistical redundancy analyses (RDA) (Parcerisas et al. 2012); ordinary least squares regression (OLS) and geographically weighted regression (GWR) (Jaimes et al. 2010; Gao and Li 2011); canonical correspondence analysis (CCA), OLS, GWR and spatial clustering procedures (Freitas et al. 2013); and stepwise multiple regression models (Gong et al. 2013). Most of these studies considered a limited number of potential independent factors that had normal distributions, as this is the basic requirement for using parametric techniques. Therefore, modelling approaches must be further evaluated in terms of the choice of independent factors and metrics, as well as the selection and interpretation of appropriate statistical methods. There is also a need for further studies that include a large number of factors encompassing, as much as possible, all aspects of the socioeconomic and biogeophysical context within which LUCC is taking place.

Despite the improvements in our understanding of the impacts of LUCC on tropical environments, there is still no optimal tool for understanding relationships between DF/FF metrics and SE or BGP factors. Random Forest analysis (RF; Breiman 2001) is a variable selection technique and has great potential in this respect. RF is capable of identifying complex interactive and non-linear response-predictor relationships, and has excellent predictive performance (Prasad et al. 2006; Smith et al. 2011). Thus, application of RF analysis to disentangle these sorts of relationships may be particularly useful. RF is used widely in bioinformatics (Cutler and Stevens 2006), for land cover classification (Gislason et al. 2006) and analysis of medical experiments, for example. Although there were few ecological applications, it has recently gained popularity in this area (Prasad et al. 2006; Fu et al. 2010; Gilbert and Chakraborty 2011; Bonilla-Moheno et al. 2012; Ellis et al. 2012a; Leal et al. 2016).

In this study, we investigate the application of RF regression to the task of identifying relationships between a large set of SE and BGP candidate independent variables (factors) and metrics which quantify the current patterns of DF and FF of the Brazilian Atlantic Forest in the state of Minas Gerais, Brazil. This study considers an unusually large set of more than 300 SE and BGP factors. The outputs from RF analysis are compared with those derived from the application of stepwise multiple linear regression (hereafter, STEP), a classical statistical approach, to the same datasets. Our hypotheses are: 1. RF is better than STEP at elucidating relationships between SE and BGP factors and FF/DF metrics. Because of the capability of RF for identifying complex interactive and non-linear response-predictor relationships, we believe that this analysis quantifies the relationships between factors and metrics more accurately than the classical approach we considered here; 2. RF and STEP identify broadly the same SE and BGP factors as being most important in explaining variation in the FF/DF metrics. Based on the LUCC literature, we expect that certain factors will be identified by the analyses as most important, regardless of the methodological approach used, such as population and roads densities, and topographic measurements (e.g. Geist and Lambin 2002; Silva et al. 2007; Freitas et al. 2010).

2 Methods

2.1 Study area

The study area is located within the state of Minas Gerais, in South-eastern Brazil. It comprises the 518 municipalities which fall entirely within the largest contiguous area of the Atlantic Forest biome and encompasses 34% (19,904,146 ha) of Minas Gerais (IBGE 2017, Fig. 1). This study site has a wide variability across the municipalities in the magnitude of DF/FF metrics and in the SE and BGP factor values.

The study region characterized by rolling hills which rise from 200 to 1600 m altitude. It is a very rugged area with a large proportion of highlands as well as plateaus and plains. There are several climate types linked to the relief, with a generally warmer climate in the north and cooler in the south. The distance from the ocean also has a climatic effect (maritime vs. inland climate, etc) upon the study area. The region is, on average, relatively sparsely populated, with a tendency for higher concentrations of population towards the south, which also has smallest municipality areas. The south part of the study area is also relatively richer and more developed when compared to the other parts of the study area and to the Brazilian average. The main industries and sources of employment are cattle herding - which corresponds to 10% of the Brazilian total - coffee production and the extraction of iron ore. All of this information on the study area and further details can be find in the ecologic-economical zoning of Minas Gerais, ZEE-MG (Scolforo et al. 2008).

Fig. 1
figure 1

Minas Gerais State, BR and the 518 municipalities used in this study. The inset map on the left show the location of Minas Gerais State within Brazil

2.2 Variable selection

This work used large datasets provided by two broader-scale projects carried out in Minas Gerais State, Brazil. The DF and FF metrics were derived from the vegetation monitoring system dataset (Scolforo and Carvalho 2006; Carvalho and Scolforo 2008, Carvalho and Scolforo—unpublished data), which comprises land cover maps from 2003 to 2011.

A DF metric, the growth rate of deforestation (GRD, percentage) was calculated for each municipality using digital change detection applied to Landsat images from the vegetation monitoring system dataset (Scolforo and Carvalho 2006; Carvalho and Scolforo 2008 Carvalho and Scolforo—unpublished data). GRD was normalized by the amount of remaining forest area within each municipality.

To quantify forest fragmentation, we used the 2011 land cover map from the vegetation monitoring system dataset (Scolforo and Carvalho 2006; Carvalho and Scolforo 2008; Carvalho and Scolforo—unpublished data). A set of 225 landscape metrics from class and landscape levels from all of the different categories available in FragStats 4.0 (McGarigal et al. 2016) were calculated for each of the 518 municipalities considering the forest cover configuration in 2011. These were then passed through a three-stage filtering process to provide a tractable set of metrics for use in our analysis of statistical approaches. Firstly, noting that metrics in datasets such as this can be highly correlated (Riitters et al. 1995), we selected a subset of uncorrelated metrics based on Pearson correlation analyses conducted using the Pairs-panel analyses in R. We discarded those metrics which were strongly correlated (defined for these purposes as having correlation coefficients for which \({p} \le 0.01\)) with selected variables, and therefore deemed to be redundant. When two or more variables were significantly correlated, the selection criteria to choose one of them were mathematical simplicity and an intuitive judgment of their explanatory power in terms of ecological meaning. Secondly, we chose metrics from the remaining subset that were commonly used in literature (those which were repeatedly found in the papers consulted) found via a search on the Web of Knowledge website (http://wok.mimas.ac.uk/). The search was carried out from 2011 to June 2013, using the key-words “landscape metrics” and/or “landscape indices”. This search yielded 48 papers, of which four were found, on inspection, to be out of scope, and we had no access to another five. The papers consulted in the review can be seen in the Supplementary material (List S1—ESM1). Finally, we verified the normality of the residuals from linear models (see the section Stepwise multiple linear regression for more details) and those metrics which had non-normally distributed residuals were discarded to enable comparative analysis of the random forest method with classical, parametric multiple regression, which requires normally distributed variables most of the times. The result of this filtering process was the selection of three FF metrics at municipality scale: the total remaining forest (CA), a quantification of remaining forest; the mean Euclidean nearest-neighbour distance (ENN), a measure of patch isolation; and the patch density (PD), a measure of forest spatial structure (Table 1).

Table 1 Descriptions of deforestation (DF) and forest fragmentation (FF) metrics (dependent variables)

The SE and BGP factors were derived from the ecologic-economical zoning of Minas Gerais, ZEE-MG (Scolforo et al. 2008). Almost all available factors were derived within political administrative units at the scale of municipalities, the smallest administrative units in Brazil. To avoid bias, we choose to use only the metrics that would permit an analysis at the municipality scale.

SE and BGP factors were obtained from the ZEE-MG database, which collates data from different national agencies. The years for which these variables were collected were limited by the availability of information from national agencies, and ranged from 2003 to 2006. Based on data availability, SE factors from four categories—production, exploration, human and institutional—were used. Variables from a further four categories of BGP factors—topography, distance, accessibility, and geographical location—were also selected. This gave an initial list of more than 300 candidate independent factors. Descriptions of how these variables were calculated can be found in Scolforo et al. 2008. From this list, a tractable sub-set of factors was derived using the first step from the filtering process described above for the FF metrics. As a result, a total of 34 SE and BGP variables were selected as factors for use in our comparative analysis of statistical approaches (see Table S2, in the supplementary material–ESM2, for a complete description of all factors).

2.3 Random forest analysis (RF)

RF is a machine-learning technique that may be used for predictive modelling of multiple outputs from large input datasets. In short, RF uses an ensemble of decision trees with binary divisions, each capable of producing an outcome when presented with a set of input values (Cutler et al. 2007). For regression modelling problems the tree response is an estimate of dependent (outcome) variable values derived from the given values of a set of independent (input) variables. RF uses a regression tree approach (also known as “CART”; Breiman et al. 1984), to build a number of decision tree models from randomly selected subsets of training samples and factors (Cutler et al. 2007). Model fitness is examined using validation data that is not in the training sub-sample; hence, cross-validation with external data is not necessary. The validation sample is also used to calculate measures of variable relative importance (Ellis et al. 2012b). The outputs from all of the trees are then averaged, which provides predictive accuracy and low bias (Breiman 2001).

We used the R “extendedForest” library provided by the Gradient Forest project (Smith et al. 2011; Ellis et al. 2012b) to carry out RF analysis. This package was developed for use in ecological studies of species distributions. It integrates results from RF analyses for a number of individual species distributions into results that enable prediction of multiple species distributions (Smith et al. 2011; Ellis et al. 2012b). In addition, it is able to analyse large numbers of potential factors and to reduce bias when predictors are correlated (Smith et al. 2011). In our study, we extended the application of extendedForest by using the DF and FF metrics described above (i.e. GDR, ENN, CA and PD) in place of the species distributions used in the application for which it was originally developed. We built partial dependence plots using the variable relative importance values. Models were fitted with 10,000 trees. In each split, we used one-third of the factors randomly sampled as independent candidates. We excluded from final models the variables with negative relative importance values, which did not contribute to the overall explanation. In order to test our first hypothesis, we also calculated the \(\hbox {R}^{2}\) in the RF approach to compare it with outcomes from the stepwise multiple linear regression.

2.4 Stepwise multiple linear regression (STEP)

From a wide range of possible approaches, we selected STEP as a comparator method against which to assess the performance of RF. This type of technique is arguably the most common approach to data-based prediction and simulation tasks (Whittingham et al. 2006). For situations in which the number of variables is high, as is the case here, it is appropriate to incorporate into the modelling process a method for selecting only those factors that contribute most strongly to the predictive model delivered. The STEP approach to multiple regression is a routine technique for achieving this (see, for example, Efroymson 1960; Hocking and Mar 1976; Furundzic 1998). Despite having a number of weaknesses, notably bias in parameter estimation, inconsistencies among model selection algorithms, and an inappropriate focus on a single best model (Burnham and Anderson 2002; Kadane and Lazar 2004; Whittingham et al. 2006), it is used widely within ecology and landscape studies (Whittingham et al. 2006).

The stepwise method combines forward selection and backward elimination procedures (Venables and Ripley 2002; James et al. 2013). It proceeds by first setting up an initial model incorporating a subset of the candidate independent variables. Then, this model is iteratively altered by adding significant variables and/or removing insignificant ones, in a process called the stepping procedure. A variable that enters at an early stage may become superfluous at later stages because of its relationship with other factors subsequently added to the model (Kleinbaum et al. 1998). To check this possibility, at each step a partial F test is carried out for each variable currently in the model, regardless of the stage at which it was entered. The whole process is repeated until no more variables can be added or removed, which means that the model is optimized, or when a specified maximum number of steps is reached. Many statistical methods are available to test the stability and validity of the final regression model. We used the adjusted square of the correlation coefficient (adjusted \(\hbox {R}^{2})\) and the AIC (Akaike Information Criteria) to assess our final model. The AIC was also used to calculate relative variable importance. Implementation was based on the dredge function for automated model selection, which is available as the R “MuMIn” package (Barton 2014). It calculates AIC values for models with all possible combinations of factors and ranks the models based on the calculated values. MuMin is also highly demanding in terms of computational time and resource requirements. We determined the relative importance of each independent variable selected in the models from the STEP approach based on AIC weights (importance function in MuMIn; Burnham and Anderson 2002). The relative importance values were converted to percentages for comparison with the equivalent outcomes from RF.

2.5 Final models

In this manuscript we have used specific acronyms for the models we have tested to make it easier for readers to understand them. For this, we used the acronyms of each of the metrics tested, which reflect deforestation (DF): GRD; and forest fragmentation (FF): CA, ENN and PD and we add the acronym of the two analysis approaches that we used: RF and STEP. The results were four models selected using the RF approach and four others using the STEP approach, respectively: the growth rate of deforestation—RF-GRD and STEP-GRD; the total remaining forest—RF-CA and STEP-CA; the mean Euclidean nearest-neighbour distance—RF-ENN and STEP-ENN; and the patch density—RF-PD and STEP-PD models.

3 Results

3.1 Random forest analysis

The RF analysis provides evidence of the effect of SE and BGP factors (see Table S2 in the supplementary material ESM2) on the forest metrics, explaining high amounts of the observed variance (99.40%) of some of them, and lower amounts of the observed variance of others (39.38%) (Fig 2—see also Table S3 in the supplementary material ESM3). In the latter cases, the outcomes imply that there is restricted explanatory power in the factors, and that variability in some of the models across the municipalities is not explained by the factors considered here. The relative importance of each factor was quantified as its partial contribution to explaining the variability of each of the four forest metrics tested by both statistical approaches, expressed as a percentage. Although, these values are not quantitatively comparable between the metrics, they allow us to rank the factors in terms of their relative importance in each metric model.

Fig. 2
figure 2

Relative importance plot for factors from random forest (RF) and stepwise multiple regression (STEP) analysis approaches, in percentage (%). Factors are defined in Table S2 (supplementary material ESM2). The eight selected models from both approaches are: the growth rate of deforestation—RF-GRD and STEP-GRD; the total remaining forest—RF-CA and STEP-CA; the mean Euclidean nearest-neighbour distance—RF-ENN and STEP-ENN; and the patch density—RF-PD and STEP-P. Metrics used in models are defined in Table 1. Note that each model shows only the most important predictors. *Percentage of variance explained in each model

Of the four models using the RF approach, RF-GRD performed best, with a very high proportion (99.40%—Fig. 2) of its variance explained by the factors. Geographical location and distance variables (longitude and the minimum distance of forest patches to the nearest reservoir and the nearest protected area) were the most important factors in this respect. Among the many factors in the GRD model selected by RF, those related to topography and crop production were also relatively important. Longitude (POINT_X) explained the greatest part of the variance in the RF-GRD model (Fig. 3a)

Fig. 3
figure 3

Partial contribution of socioeconomic (SE) and biogeophysical (BGP) factors to deforestation (DF) and forest fragmentation (FF) in Minas Gerais, Brazil, derived from Random Forest (RF) analysis approach. Factors are defined in Table S2 (Supplementary material ESM2). a The growth rate of deforestation (RF-GRD); b Patch density (RF-PD); c The total remaining forest area (RF-CA); and d The Euclidean nearest-neighbour distance (RF-ENN). Metrics used in models are defined in Table 1

The selected patch density model (RF-PD), had the second highest amount of its variation explained (61.52%, Fig 2). A large number of factors were identified as having some role in explaining RF-PD variations between municipalities; those with the highest importance were associated with the road network or were topographic. Road density was the factor which most explained the variance in this model (Fig. 3b).

The selected models for total remaining forest (RF-CA) and of the mean Euclidean nearest-neighbour distance between forest patches (RF-ENN) also had relatively-high amounts of their variation explained (40.67 and 39.38%, respectively, Fig. 2). The factors with the highest importance for predicting these models were the mean slope of each municipality (Fig. 3c) for the selected RF-CA model and the mean altitude of each municipality (Fig. 3d) for the selected RF-ENN model. Other topographic factors (the mean altitude across each municipality for RF-CA, and the mean slope across each whole municipality, and the mean slope within deforested areas, for RF-ENN) were also relatively important, as were geographical location and distance factors (distances to the nearest protected area and nearest steel mill, and longitude).

Overall, factors from the geographical location, distance, topography, institutional and accessibility categories appeared among the most important factors in all four selected models from RF approach, namely: the latitude of municipalities; the minimum distance from forest patches to the nearest steel mill and the longitude of municipalities; mean slope, mean slope within deforested areas and mean altitude; the amount of protected area in each municipality; and the density of roads.

3.2 Comparisons of RF with STEP

Outcomes from the STEP approach are shown alongside those for RF, in as comparable a form as possible in Fig. 2. Note that, although “percentage importance” values are quoted for models from both analysis approaches, these values are not quantitatively comparable between these two methods’ outcomes or between different metrics addressed in the models. Rather, these values allow us to rank the factors in terms of their relative importance for explaining the variability of each model. The percentages of variance explained by the two analysis approaches are, however, comparable. Both approaches provided evidence of relevant relationships, but models from the RF approach surpassed the capacity of the classical approach for explaining model variance. However, the results are mixed in terms of the factors selected as being most important by each approach.

The selected STEP-CA model performed best of all models from the STEP approach. It explained an amount (39.80% c.f. 40.67% for RF-CA) of CA variation between municipalities similar to that explained by RF. There was also a strong similarity between the most important factors selected by the models from both approaches, since all of the factors selected by STEP were also selected by RF, except soil types and employability. The mean slope was the most important factor explaining the selected models from both approaches. Other important factors were latitude, longitude and mean altitude. The amount of protected area in each municipality and the number of rural family farms were also important in STEP-CA.

STEP-ENN had the second highest value of ENN explained variance (30.91% by STEP-ENN, 39.38% by RF-ENN). Factors were less similar between ENN models than in the CA models. While the mean altitude was the most important factor found by RF-ENN, four factors were important in the STEP-ENN selected model, namely: the mean slope, soil type, density of roads and latitude.

The selected PD model from the STEP approach (STEP-PD) also had a relatively high amount of its variance explained compared to the other models from STEP, but much less than the selected RF-PD model (29.40% c.f. 61.52% for RF-PD). Some of the factors were found in the selected models from both approaches. However, only one of the most important factors appeared in both of these models: the mean slope of deforestation patches, a topographic factor. The density of roads was the factor identified as being most important by RF-PD, while a similar factor, the minimum distance to the nearest road had the highest importance in STEP-PD. Another topographic factor important in the STEP-PD was the minimum mean slope within each municipality, while in RF-PD the mean altitude, and latitude were also important.

There was a strong contrast between the amounts of variance explained for the growth rate of deforestation by STEP (17.36%) and RF (99.40%) approaches. In STEP-GRD, the minimum distances to the nearest protected area and nearest steel mill were the most important factors explaining GRD variance, followed by the mean slope and the amount of protected area. In RF-GRD, the longitude and, secondarily, the latitude and minimum distances to the nearest steel mill and nearest reservoir were also important.

4 Discussion

4.1 Random forest analysis

In the RF approach outcomes, we observed that there are some strong relationships between the SE and BGP factors and DF/FF metrics. RF performed best for the growth rate of deforestation (RF-GRD) and secondarily for the patch density (RF-PD) selected models, explaining around 99.40 and 61.52% of their variances, respectively—high values for ecological studies. It also performed relatively well for the total remaining forest (RF-CA) and patch isolation mean Euclidean nearest neighbour distance (RF-ENN) selected models, explaining 40.67 and 39.38% of their variances, respectively. In terms of model performance, this may suggest that the RF approach is good at identifying factors that describe some macro-scale forest metrics (rate of deforestation and the overall remaining forest) and the distribution of patches within a landscape (their density and mean isolation from each other. Alternatively, these results could be interpreted as indicating that the rate of deforestation, remaining forest and patch-distribution scale variables (GRD, PD, CA and ENN) are closely linked to the factors we have considered here. In other words, RF is particularly good at identifying links for the types of factors we analyse, since it performs better providing a higher amount of explanation of the variance in metrics. It is important to note that, even using a very large dataset comprising many factors, much of the variance in some of the four metrics was not accounted by our selected models. In addition, the question of whether it is primarily the nature of the model or the nature of the factors that has led to this finding is not answerable by this first application of RF to this type of data, and remains to be addressed by further investigation.

Turning now to consideration of the factors, we found that some of them were particularly strongly related to some of the metrics, for example longitude (which explained 20.7% of GRD variance), road density (which explained 20.4% of PD), and mean altitude (which explained 18.5% of ENN). However, neither the nature of, nor the reason (i.e. whether they are causatively-linked or simply co-vary) for these relationships are elucidated by RF. Despite these cases of strong individual-variable links, no single factor was found to be related to all of the metrics. Geist and Lambin (2002), who investigated the causes of deforestation of tropical forests, also did not find a single important factor. They concluded that forest loss is due to a combination of factors that vary with historical and geographical context. We conclude from the present study that we can expect the same for forest fragmentation.

At the level of factor categories and considering only the three factors in each model which made the strongest contributions to explain metrics variance, we found that those from the Geographical location, Topography, Distance and Accessibility categories contributed most to explaining variance in the RF outcomes. On the other hand, variables from the Exploration, Institutional, and Productivity categories made hardly any contribution. Additionally, we found that factors from the Geographical location and Topography categories made up the majority of the most-important factors explaining each dependent variable in the models from RF approach. This suggests that the physical environment is more important for determining variations in DF and FF metrics between municipalities, than social or economic issues. Other studies conducted in the Atlantic Forest agree with our results, showing that physical environment factors plays a significant role in deforestation and forest fragmentation (Silva et al. 2007; Teixeira et al. 2009; Freitas et al. 2010). In other countries of Latin America, a similar pattern can be also observed, with the physical environment being more important than socioeconomic or demographic factors for explaining land-cover change (Bonilla-Moheno et al. 2012; Redo et al. 2012). In addition, specifically in our case, geographical location is important considering the discrepancies between the north and south parts of the study area, mainly in terms of development, that also could work as a proxy of some socioeconomic and demographic factors. However, these findings do not exclude the contribution of socioeconomic or demographic factors upon deforestation and forest fragmentation, since they might be indirectly linked to the physical environment factors. For example, deforestation is more likely to be located in lower and less steep terrain, where transport and mechanical agriculture are easier (Apan and Peterson 1998). They are more likely to have occurred in sites more suitable for agriculture (Flamenco-Sandoval et al. 2007; Killeen et al. 2007; Fearnside 2008) in terms of soil fertility and hydrological conditions. This finding has important implications for management policies aimed at conserving the Atlantic forest and possibly other biomes that are fragmenting under anthropogenic pressures, although it requires further evidence to be confirmed. This points out the importance of valuing biodiversity in impacted sites (lower and less steep terrain) when selecting areas for conservation, for example (Margules and Pressey 2000; Metzger and Casatti 2006). Also, although this ordering of importance of the different types of factors is quite coherent across the RF approach outcomes, the question remains as to whether it is “true”. Claims to this effect are supported by noting that factors that RF-type methods have identified as most important for classification have been found to coincide with ecological expectations in the literature (Cutler et al. 2007; Wei et al. 2010; Ellis et al. 2012b).

4.2 Comparisons of RF with STEP

Like RF, the STEP approach found some strong relationships between the SE and BGP factors and DF/FF metrics. Unlike RF, STEP selected models found the most explained-variance and strongest relationships for the amount of forest, followed by the isolation of forest patches. Unlike RF, however, there was less difference in the performances of models from STEP approach: while the explained variances from RF ranged from 39.38 to 99.40%, STEP explained between 17.36 and 39.80% of the variance of all four metrics, confirming our first hypothesis, that RF quantifies the relationships between factors and metrics more accurately than the STEP approach.

Contrary to our second hypothesis, overall there was more disagreement than agreement between the two approaches in terms of the selection and importance of independent factors. A low number of factors was selected as important and shared by both approaches. Considering the categories of factors, both approaches found that factors from the Topography category were of higher importance in all selected models, while the Geographical location was more important in the selected models from RF than from the STEP approach. Factors from the Distances and Accessibility categories were of intermediate importance, and factors from the Exploration, Institutional and Production categories were of little importance. In the selected models from the STEP approach, we found that the most important factors explaining each forest metric also belonged to the Distances and Topographic categories.

The most important factors of selected models in the RF approach were subtly different than those selected in the STEP approach. Considering the selected rate of deforestation model from RF, the most important factor influencing it is longitude of municipalities, which represents a measure of the distance from the ocean (climate proxy). We expected that deforestation increases along a socioeconomic gradient, which may reflect a higher degree of development, and consequently, higher exploration of natural resources, for example. On the other hand, the most important factors in the selected model from STEP were the minimum distance to protected areas. In a similar way, we expected that deforestation decreases when forest patches are closer to natural reserves. The smaller the distance, the closer the forest patches are to a natural reserve. This may mean that there is a greater amount of forest in the municipalities where the forest patches are closer to the natural reserves, whereas in those municipalities where the reserves are more distant, there is possibly a smaller amount of forest, and therefore, deforestation rate is also smaller. Although different, these two factors may be ecologically linked to deforestation rates.

Turning to isolation of forest patches, two different factors from the Topography category appeared as most important factors in the selected models from RF and STEP, respectively, the mean altitude of each municipality and the mean slope across each whole municipality. Although different measurements, these factors are related to the relief of the study area, that plays an important role influencing deforestation (Silva et al. 2007) in The Atlantic Forest Biome. Also, due to intense exploration in the last 500 years, the Atlantic forest remnants are currently restricted to the higher elevations and steeper reliefs (Dean 1996; Oliveira-Filho and Fontes 2000; Ribeiro et al. 2009; Kauano et al. 2012). The most important factor for the amount of forest was mean slope in the models selected from both statistical approaches.

The density of forest patches was mostly affected by two similar factors: the density of roads in the selected model from RF and the minimum distance to the nearest road in the selected model from STEP. These findings are consistent, since roads serve as fragmenting features (Forman and Alexander 1998; Butler et al. 2004), subdividing forests, increasing the number of forest patches and reducing forest connectivity. Roads have few positive or neutral environmental impacts but numerous negative effects. Positive impacts include increasing accessibility (Leinbach 1995), which can also be negative since this facilitates deforestation (Laurance et al. 2001). Negative impacts include habitat loss, degradation, fragmentation, direct wildlife mortality and road avoidance behaviours by wildlife (Forman and Alexander 1998). Therefore, density of roads plays an effective role in forest fragmentation and the minimum distance to the nearest road also reflects this role.

Notwithstanding a few similarities between the outcomes of the two modelling approaches, differences between them are strongly evident. However, the reasons for these differences are not clear from our results and require further investigation. Nonetheless, in theory, one would expect the RF approach outcomes to identify more reliably than STEP the factors that have greatest influence over models. This expectation arises from the greater robustness of RF type methods compared to traditional regression approaches. Unlike traditional regression, which has well known weaknesses, despite still being widely used in ecology (Whittingham et al. 2006), RF methods make no assumptions about the distributions of variables and are robust to outliers in factors. They can also handle situations where the number of factors exceeds the number of observations and have a novel variable importance measure, which does not suffer the shortcomings of traditional variable selection methods, such as selecting only one or two variables among a group of equally good but highly correlated predictors (Cutler et al. 2007). Thus, the greater range of values of explained variance in the RF outcomes compared to the STEP outcomes may be indicative of their greater robustness and ability to distinguish meaningfulness relationships. Furthermore, many of the studies that have applied classical regression approaches to understand the drivers of forest cover changes (e.g. Jaimes et al. 2010; Gao and Li 2011; Freitas et al. 2013; Gong et al. 2013) may have had to use a restricted number of factors to be able to satisfy the requirements of normality, which could have hindered the analyses, whereas the flexibility and robustness of RF overcomes such limitations.

Despite its advantages, RF can suffer one main limitation. Unlike traditional regression methods, RF methods do not produce relationships between independent factors and metrics that have simple representations (such as linear equations), making ecological interpretation difficult (Cutler et al. 2007). Nevertheless, the R “extendedForest” library has overcome this issue. This package allows us to generate partial plots, which indicate the direction and form of the response to an individual factor. Therefore, we can now convert the RF outcomes into equations for quantitatively predicting changes in DF/FF metrics that might arise from changes in the BGP and SE factors considered here. Additionally, RF has exploited structure in our high-dimensional data set not “visible” to STEP in the growth rate of deforestation (GRD) and patch density (PD) selected models to provide an apparently clearer picture of these metrics’ relationships to the factors.

5 Conclusion

Understanding spatial relationships between patterns of DF/FF metrics with SE and BGP factors is important for land use management. The main contribution of this study is the testing of a relatively new application of RF for detecting this kind of relationship, its application to a very large dataset and its comparison with a traditional multiple linear regression method. We found that RF performs better than multiple regression at explaining metrics describing forest patch patterns (PD and ENN) and broader landscape structures (GRD and CA). Given the well-established advantages of decision-tree-based methods over those of classical multiple regression (Breiman et al. 1984; Breiman 2001; Prasad et al. 2006; Cutler et al. 2007, 2008; Pitcher et al. 2011; Ellis et al. 2012b; Cutler 2013; Smith et al. 2013), we suggest that the reasons for these differences are likely to be because the patch-pattern metrics and broader landscape structures vary in less smooth or monotonic ways (McGarigal et al. 2016)—ways that RF is able to capture, but multiple regression is not. Accordingly, we have shown that RF provides a promising methodology for identifying these relationships, and that it has the potential to be an effective tool for providing essential information for aiding land use management decisions, not only in terms of planning, but also for conservation actions, as proposed by Zanella et al. (2012), in cases of high rates of anthropogenic biodiversity loss, as it is the case of the Atlantic Forest.

The initial investigation reported in the present study is, however, only a first step in exploiting this method’s potential. One aspect that requires further consideration is the scale of the study area and the very wide variety of SE and BGP contexts, which it encompasses. Even in relatively small areas, a multitude of diverse factors are at work (Qasim et al. 2013), and variations in contexts may have influenced model performance in the present study. Landscape pattern is scale-sensitive (Gao and Li 2011) and the unusually large degree of heterogeneity in the Atlantic forest biome is likely only to exacerbate this issue. Policies need to be crafted at appropriate spatial scales and with specific contexts in mind. Thus, an important development of this initial study of RF application to cases of DF/FF would be to repeat it at different spatial scales, to identify more precisely the SE and BGP factors associated with these processes.