Background

Spatial autocorrelation

The use of spatial or geographical data entails learning about the properties of such data. Disciplines in which geographic data are used are all concerned with how such data are characterized, whether it be geography, ecology, or any related field where the space and time factors are involved. One of the most common issues regarding spatial data is the existence of structure or dependence among the observations. Often, processes, whether they be environmental or biological, are related spatially or temporally. This fact translates the notion of distance decay wherein the degree of dependence decreases over space. This is the basis of Tobler’s (1970) first law of geography: everything is related to everything else, but nearby things are more related than distant things. This whole argument falls under the concept of spatial autocorrelation (SAC). This term, which was introduced around the late 1960s and early 1970s (Getis 2008), is loosely defined as follows:

The property of random variables taking values, at pairs of locations a certain distance apart, that are more similar (positive autocorrelation) or less similar (negative autocorrelation) than expected for randomly associated pairs of observations (Legendre 1993, p. 1659).

Depending on the factors that drive natural processes, SAC is categorized into two major types: exogenous and endogenous SAC (Legendre 1993). The former is caused by external environmental (physico-chemical, climatological, geomorphological) factors such as temperature, soil, and terrain attributes (Dormann 2007a; Kissling and Carl 2008; Miller 2012; Václavík et al. 2012). It is generally associated with broad-scale spatial trends (Miller et al. 2007; Václavík et al. 2012). Endogenous SAC is induced by biological (or biology-related) processes (geographic dispersal, predation, disturbance, inter-specific interactions, colonial breeding, home-range size, host availability, parasitization risk, metapopulation dynamics, history) that are inherent to the species data (Dormann 2007a; Kissling and Carl 2008; Miller 2012; Crase et al. 2014). It reflects contagion effects in cases of positive autocorrelation or dispersion effects for negative autocorrelation (Lichstein et al. 2002; Griffith and Peres-Neto 2006; Crase et al. 2014). Such endogenous SAC is relevant at fine scales or to high-resolution stochastic biotic processes (Dormann 2007a; Miller et al. 2007; Chun and Griffith 2011; Václavík et al. 2012; Kim 2018).

Residual spatial autocorrelation

In the modeling context, residuals represent the differences between observed and predicted values. Hence, rSAC indicates the amount of SAC in the variance which is not explained by explanatory variables. Understanding residuals distribution is key to regression modeling, as assumptions such as linearity, normality, homoscedasticity (equal variance), and independence rely on the behavior of the errors.

Incorporating or ignoring rSAC has implications directly impacting the outcomes of species distribution modeling (SDM). In fact, failing to appropriately address rSAC will likely lead to three major statistical problems. First, the standard errors might be underestimated, leading to Type I error. This means that the existence of dependence between pairs of observations across space where independence is assumed can result in falsely rejecting, much more often than expected, the null hypothesis while it is true (Lennon 2000). Consequently, that will make the regression model itself unreliable (Legendre 1993; Anselin 2002; Kim et al. 2016). Second, parameter estimates, such as the regression coefficients and F-statistic, might be biased (Dormann 2007a; Václavík et al. 2012). The inflation or deflation of predictors’ coefficients will induce the over- or under-estimation of their predictive power, respectively. Finally, model misspecification, related to variable selection, remains an important issue (Austin 2002; Lichstein et al. 2002; Miller et al. 2007; Václavík et al. 2012). The presence of SAC in model residuals is typical of spatial ecological data (Borcard et al. 1992; Lennon 2000; Dormann 2007a; Kissling and Carl 2008; Bini et al. 2009; Kim and Shin 2016); therefore, using these types of data usually violates the assumption of independence between pairs of observations, necessitating that the effects of rSAC be accounted for (Diniz-Filho and Bini 2005; Bahn et al. 2006).

Species distribution modeling

The views of previous species distribution modeling studies are mixed in regard to certain effects of SAC on the outcomes of spatial predictive models. In some articles (e.g., Lennon 2000; Dormann 2007a; Kim et al. 2016), the three statistical consequences briefly mentioned in the preceding section are well recognized. For example, Lennon (2000) urged ecologists to start integrating SAC in their modeling. Convinced of the ill effects of failing to incorporate SAC in ecological data modeling, he took a strong stance suggesting that such effects can invalidate previous works that used standard non-spatial models (e.g., ordinary least squares; OLS). In other research (Dormann 2007a; Kim et al. 2016), the voice was moderate. That is, despite the fact that spatially explicit models generally outperform their non-spatial counterparts (i.e., greater R2 or lower rSAC), the final conclusions were rather tentative. In his review, Dormann (2007a) estimated, on average, a positive coefficient shift in favor of a spatial model as high as 25% and concluded that in certain methodological conditions, such models showed an edge over non-spatial models. Subsequent to Dormann’s (2007a) review, two studies (Kim 2013; Kim et al. 2016) consistently witnessed a better performance of spatially explicit models over non-spatial ones. However, it was concluded that whether that superiority holds true for any spatial methods, sampling strategies or field designs remains to be seen. It was suspected that whether data were collected randomly, on a grid, in a nested or stratified fashion, or how densely the samples are distributed might make a difference in the modeling outcomes. As compelling and relevant as SAC appears to be, only a minority of published studies in the ecological field—for example, less than 20% (Dormann 2007a) or 3 out of 44 (Crase et al. 2014)—working with spatial data have addressed the issue.

On the other hand, there are other studies (e.g., Diniz-Filho et al. 2003; Hawkins et al. 2007; Bini et al. 2009; Miller 2012) in which the abovementioned claims were not agreed. To wit, the question concerning which parameter estimates in non-spatial modeling (models that do not account for SAC) are biased was not a critical issue. For example, Hawkins et al. (2007) warned about claiming the superiority of spatial models and the falseness of non-spatial ones as they found no significant differences between global OLS models and spatial models, especially when using gridded data. For them, the assumption that non-spatial models are automatically flawed, as argued by Lennon (2000), in comparison with spatial models was a mistake. Moreover, changes in coefficients between spatial and non-spatial models were mainly idiosyncratic and depended on the type of method used (Bini et al. 2009), which suggests that modelers should be explicit and cautious in their claims. These conclusions were already drawn in previous studies where non-spatial regression models were found unbiased. Additionally, these conclusions recommended that the scale factor be considered when interpreting results (Hawkins et al. 2007). Therefore, claiming that models that ignore SAC are flawed is groundless (Diniz-Filho et al. 2003). In addition, mathematical analyses show that neither coefficients of spatial models nor those of non-spatial models are totally unbiased; in fact, precision decreases for spatial model coefficients as SAC increases (Miller 2012).

Justification

A substantial number of studies in biogeography and macroecology have broadly covered the topic of SAC, but little is known about how deeply those works have discussed the case of rSAC. Previous studies suspect that failing to include certain explanatory variables might be at the heart of the problem (Crase et al. 2014). This problem, when related to the endogenous rather than the external type of SAC, remains unexplored. An effort to identify potential missing variables and establish how much their omission increases the level of rSAC would potentially bring new knowledge and contribute to the SDM literature body. Along with environmental and biotic missing predictors, the type of sampling design will also be scrutinized. Sampling design is often mentioned as having the ability to potentially cause rSAC to increase (Lichstein et al. 2002; Bini et al. 2009; Crase et al. 2014). This present paper addresses sampling design in terms of sample size, data type, sampling technique, and the effect of small scales in particular. Analyzing data at very fine scales coupled with the inclusion of important spatially autocorrelated missing variables is believed to have the potential to significantly reduce or even remove rSAC in species distribution models. Assuming that environmental factors behave differently at distinct spatial scales, Diniz-Filho et al. (2003) suggest that the inclusion of relevant environmental factors acting at each scale in a regression model would eventually remove SAC from the residuals at different scales.

Our goal in this review article is to evaluate an umbrella research question: Under what circumstances can the magnitude of rSAC increase? This question is broken down into the following three sub-questions:

  1. 1.

    What are the causes of rSAC?

  2. 2.

    How much do missing variables account for rSAC?

  3. 3.

    How do different sampling designs influence the level of SAC in model residuals?

Completing this investigation is expected to accomplish the following: (1) establish the full picture of rSAC in the existing literature of macroecological and biogeographical modeling and (2) serve as a foundation to conduct further research on rSAC.

Articles search, selection, and categorization

In this review, we initially targeted articles from macroecology and biogeography that dealt with SDM in which SAC was explicitly incorporated. We used keywords such as residual spatial autocorrelation, spatial autocorrelation, ecological, or biogeographical, as well as species distribution modeling, to search for relevant articles via the Web of Science and Google Scholar engines. We also selected additional articles quoted and referred to by some of these original selections. Thus, some of the studies reviewed in this paper were not exactly from the macroecology and biogeography fields. The subjects of these additional articles belonged to the disciplines of hydrology, soil science, and geomorphology, but they still covered important aspects of SAC in terms of methods, functions, history, and modeling.

As a result, we have chosen a total of 97 articles dating from 1984 to 2017 (Table 1). These articles were carefully reviewed and then categorized based on the level of detail they discussed on rSAC. In the end, we attempted to understand the conditions under which SAC occurs—and magnifies—in model residuals.

Table 1 Literature review in macroecological and biogeographical modeling. SAC spatial autocorrelation, rSAC residual spatial autocorrelation

In terms of approach, the articles reviewed were all unique with respect to SAC modeling in geographical ecology. However, SDM remained as the most studied topic across the board (61% of the articles), followed by habitat suitability modeling (22%) and methods (16%). The remaining proportion discussed other aspects of SAC modeling. The modeling included many species, such as birds, plants, mammals, and reptiles. Here are some proxies used as dependent variables: richness, occurrence, abundance, presence and absence, occupancy, composition, dispersal, diversity, and density. For habitat suitability, some surrogates were niche suitability, habitat distribution, climatic suitability, climatic forecast, or predictability.

Potential sources of residual SAC in SDM

Reviews of the existing literature revealed that accounting for SAC in SDM still has a long way to go, even though studies have increasingly strived to broadly incorporate the effect of spatial dependence in investigating ecological and biogeographical processes over the last three decades. We found that only a small proportion (less than 20%) of ecological and biogeographical modelers incorporated SAC in their research. This is due partly to the fact that the need to incorporate SAC has yet to become unanimous among modelers (Diniz-Filho et al. 2003; Hawkins et al. 2007; Bini et al. 2009; Miller 2012). The presence of SAC in ecological and biogeographical data has long been identified (as far back as the late 1970s), and statistical methods to address it were developed almost in the same period (Dormann 2007a). For example, Legendre (1993) defined and categorized the concept of SAC into endogenous and exogenous SAC in the field of ecological data modeling. However, modelers started substantially publishing studies that integrate SAC after 2000. This reality agrees with the reason why 92 out of the 97 articles we reviewed were published in the new millennium. Some of the earlier studies that acknowledged the effect of SAC prior to 2000 include, but are not limited to, Borcard et al. (1992) who looked at partialling out the total variance of species abundance into spatial and non-spatial components and Pickup and Chewings (1986) who investigated the prediction of erosion and deposition in alluvial landscapes of central Australia.

These discussions explain why rSAC, as a subcategory of SAC, remains relatively unexplored in ecological and biogeographical modeling. We categorized the articles into three groups (i.e., no mention, simple mention, and elaborate) based on the level of details at which a discussion is provided on rSAC (Table 2). In fact, 35 articles (36%) never mentioned the presence or influence of rSAC. The remaining 62 (simple mention plus elaborate) articles somehow mentioned rSAC. Only 51 of the articles provided more in-depth discussions on the subject (i.e., the elaborate category which represents 53%). The fact remains, however, that these levels of information provided by the 62 articles are still insufficient for estimating which factors possibly induced the occurrence of rSAC during modeling procedures. It is worth noting that 11 (the simple mention) of these 62 articles only referred to the term residual spatial autocorrelation once or twice in their introductory sections. The remaining 51 articles provided more detailed and descriptive information about rSAC. Such details included the definition of rSAC, its origin, methods and suggestions on how to address it, and its quantification using Moran’s I (Table 1). Below, we discuss five possible mechanisms or factors that potentially drive rSAC in ecological and biogeographical modeling.

Table 2 Summary of the reviewed articles with regard to the level of detail they provided on residual spatial autocorrelation

Ecological data and processes

Conceptually speaking, SAC is likely to exist in any spatial data because observations from close locations are generally more related than would be expected on a random basis (Kissling and Carl 2008). The interactions between responses at these locations’ zone of spatial influence result from, for example, contagious biotic processes, such as dispersal, growth, mortality, spatial diffusion, diseases, reproduction, and predation (Borcard et al. 1992; Lichstein et al. 2002). These processes can eventually create spatial patterns in species data without the influence of other external environmental data (Borcard et al. 1992). Furthermore, Kim (2013) mentioned the increase in size or a reduction of vegetation as another contagious biological process that can explain the presence of fine-scale intrinsic SAC in spatial environmental data (e.g., soil moisture). Another reason why SAC occurs in ecological data is the diffusive property across space in the movement of environmental and biotic processes, whether it be on the surface of the Earth or below the ground (Kim et al. 2016). Such environmental factors distributed continuously across the geographical space explain why, for example, species composition is similar among neighboring locations, as most species generally occupy the ranges that are greater than the cell size under study (Diniz-Filho et al. 2003). Consequently, Diniz-Filho et al. (2003) noted that using coarse scales to explain species richness would indubitably deemphasize variations at very fine scales. They suggested the use of diffusive ecological processes that act at small scales to capture information on species composition. In fact, other subsequent studies (e.g., Václavík and Meentemeyer 2009) sought to capture small-scale contagious processes leading to spatially dependent distributions and thereby violating the assumption of equilibrium between species and environmental controls (Václavík et al. 2012). These studies used multiple degrees of spatial dependence to investigate the effect of dynamic contagious processes in empirical data. Therefore, inherently, any field where such data are analyzed is subject to having to address the issue of SAC induced by contagious processes. In this context, spatial dependencies will probably show up in models using ecological data and processes (Kissling and Carl 2008; Bini et al. 2009; Crase et al. 2014). Models using spatial data are not only susceptible to having spatially autocorrelated residuals as Revermann et al. (2012) noted. In particular, working with grid data almost guarantees that SAC patterns be observed in the errors (de Oliveira et al. 2012). In some cases, this is labeled a mismatch between a process unit and an observational unit.

Scale and distance

In fact, several studies have reiterated that rSAC is closely related to distance. For Bini et al. (2009), rSAC was stronger at smaller distances in most empirical data sets. Some researchers have used terms similar to scale and distance presenting the circumstances in which model residuals show spatial dependencies. Lichstein et al. (2002) mentioned first proximity or distance and then defined the concept of appropriate neighborhood size. For these authors, distance among samples was a necessary condition for the presence of rSAC in regression models. Such patterns occurred within an “appropriate neighborhood size,” or the maximum distance at which model residuals are autocorrelated. Therefore, when spatial data are analyzed, an inappropriate spatial resolution will often generate rSAC (Dormann 2007a). It is clear that more works acknowledge the type of scale as a determinant factor for rSAC. Crase et al. (2014) suggested that most of the SAC occurred at small scales (less than 1 km). It is worth pointing out that failing to account for small-scale environmental factors (Diniz-Filho et al. 2003) or only accounting for broad-scale spatial structures (Diniz-Filho and Bini 2005) will result in positive rSAC in species richness modeling at small scales. Thus, all these local-scale spatial structures (Wu and Zhang 2013) accumulated and caused spatial dependencies in the residuals of, for example, bird richness modeling (Bahn et al. 2006). Bahn et al. (2006) conceded that rSAC disappeared when using environmental predictors at large scales (> 100 km). They also admitted that the omission of important community-scale processes constituted another crucial factor of spatial dependence.

Missing variables

Variable selection is one of the characteristics that are used to compare traditional non-spatial models to spatial models which explicitly account for the presence of SAC. One explanation of the differences between non-spatial and spatial approaches in selecting variables is that non-spatial models tend to recover the missing spatial information by adding environmental variables that happen to be spatially autocorrelated (Bahn et al. 2006). In fact, failing to select relevant localized, spatially autocorrelated variables is one of the primary causes, if not the first, of rSAC. Leaving out important spatially autocorrelated predictors can directly lead to model misspecification (Bini et al. 2009; Miller 2012), which potentially generates rSAC and creates an instability associated with Lennon (2000)’s “red shift” problem (Bini et al. 2009). As supported by Bini et al. (2009), whenever such unmodeled spatially dependent predictor variables are included in the model, the degree of rSAC decreases. In contrast, when SAC is accounted for as in the case of a spatially explicit model, the relative importance likely decreases for spatially autocorrelated independent variables. Certain predictors affect the response of ecological and biogeographical processes only at local scales. Conducting broad-scale modeling will undermine such localized response variables, thus resulting in the creation of rSAC (Diniz-Filho et al. 2003). Studies suggest that failing to include important variables also causes positive rSAC, which may serve as an indicator for model misspecification (Lichstein et al. 2002; Diniz-Filho et al. 2008; Kissling and Carl 2008; Bini et al. 2009). Residual SAC is a sub-type of either exogenous or endogenous SAC. Therefore, there will be a possibility that residuals are also autocorrelated, provided that one of these two types of SAC exists in the data, as corroborated by Diniz-Filho and Bini (2005), Miller et al. (2007), Václavík et al. (2012), and Crase et al. (2014).

Sampling design

Under this “sampling design” group is considered sampling size, measurement, founder effect, sampling scheme, and sampling intensity. Each one of these factors is believed to lead to residual spatial dependencies as stated by previous studies. Bini et al. (2009) observed that a high level of rSAC is usually present in data sets with many observations. On the other hand, Lichstein et al. (2002) suggested that autocorrelated residuals can well be caused by poorly measuring an important autocorrelated predictor. In species assemblage data such as species richness and proportion of endemic species, to name a few, the sampling category is called “artifacts” in a sense that they are not due to the environment but rather from a researcher (Dormann 2007a; Crase et al. 2014). According to these authors, these artifacts are difficult to correct, and they eventually show rSAC. The artifacts are caused by species-specific bias or different recorder density. As an example, taxonomists may split plant species into more “species” than common botanists would, or a data recording team may sample one area more intensively than another, creating a bias unrelated to the environment. Additionally, a different sampling scheme would generate rSAC when regions of a known occurrence are sampled with higher intensity than regions of an unclear occurrence. Finally, ecological interactions between species (e.g., competitive exclusion and founder effects) in isolated habitat patches, such as fragmented landscapes and lakes, will add to SAC in assemblage data that are absent from individual species distribution data (Dormann 2007a; Crase et al. 2014).

Assumptions and methodological approaches

Falsely assuming linearity between two factors, using a wrong variable selection method, and ignoring the presence of non-stationarity in a data set can lead to model residuals being spatially autocorrelated. As Bini et al. (2009) noted, for example, fitting a linear model to a quadratic distribution or response would result in the residuals being spatially autocorrelated. Moreover, performing model selection requires modelers to go through several important steps including variable selection. Different approaches are used in variable selection. Le Rest et al. (2014) found that the Akaike information criteria, when used as a metric to select variables in the presence of rSAC, proved to pick up unnecessary variables to the detriment of important predictors, thereby ignoring the presence of structure in such residuals. Bini et al. (2009) defined non-stationarity as the non-consistency in the relationship between variables throughout the whole extent of the data. Non-stationarity is less intuitive and less used compared to SAC and has only lately been incorporated in SDM (Miller 2012). The concept can be viewed as the spatial variant of a constraint in correlation and regression modeling known as the Simpson’s paradox (the linear trend of a sub-group is reverse of that of the overall group). It is the statistical formalization of spatial heterogeneity, which defines uneven distribution across space (like SAC, it is generally caused by sampling differences, another process in different locations of the study area or model misspecification such as missing variables). Bini et al. (2009) observed that high rSAC is usually present in data sets with high levels of non-stationarity. Similarly, Lichstein et al. (2002) argued that misspecifying a model form, such as assuming linearity when the relationship is nonlinear, may lead to spatially autocorrelated residuals. According to Wu and Zhang (2013), rSAC will probably result from linearity oversimplification. In sum, all these authors agree that residual dependencies may result from an assumption that one makes and the methodological approach that one chooses.

Conclusions

In macroecological and biogeographical modeling, multiple facets of SAC have extensively been investigated. In fact, incorporating SAC in modeling process, comparing spatial and non-spatial modeling, and identifying the potential consequences arising from the presence of spatial structure are relatively well addressed in previous studies. There seems to be a consensus that spatially explicit models in most cases outperform non-spatial models that ignore the effects of spatial dependence. However, understanding the reason why such differences in model performance exist and the circumstances under which they magnify has yet to be investigated (Crase et al. 2014; Kim et al. 2016; Miralha and Kim 2018). Most importantly, it is agreed that modeling outcomes and inferences are most affected when model residuals are spatially autocorrelated. Therefore, there has been a sense of urgency and a need to investigate rSAC in more detailed and explicit ways.

Our review of the major studies covering the topic of SAC allowed us to identify the potential sources of rSAC. In fact, a thorough review of the works reveals that the nature of the data, missing autocorrelated variables, scale, sampling design, and false methodological assumptions constitute the primary causes of SAC in model residuals. In addition to the causes of SAC, it turned out that SDM and habitat suitability modeling in birds, plants, mammals, and reptiles along with methods are the most studied topics. Despite being somewhat subjective, this categorization is an important finding, considering that it provides a better understanding of the circumstances under which model residuals are spatially autocorrelated.

The lack of quantifiable data, however, prevented us from assessing the magnitude to which rSAC is a real issue in SDM. In our review, the proportion of papers (64% including those elaborate and simple mention categories; Table 2) that mentions rSAC for the most part does so slightly and fails to contain quantitative information that would in turn allow any estimations. This review shows that rSAC in macroecological and biogeographical models are mainly intrinsic as inherent biotic processes drive the presence of spatial structure in the errors. Thus, it suggests a need for future investigations to aim at quantifying rSAC and analyzing its augmentation patterns. It is worth examining the role of missing variables, diverse sampling designs, and types of data along with model misspecification in inducing the presence of SAC in model residuals. Therefore, using combinations of such factors at multiple scales to model macroecological and biogeographical processes is strongly recommended.