Introduction

In recent decades, biodiversity and ecosystems have come under intense pressure from global changes that are significantly altering species composition and distribution (Dawson et al. 2011). Thus, understanding the spatial distribution of species and the underlying environmental factors is a fundamental question for a wide range of ecological, evolutionary and conservation applications (Guillera-Arroita 2017; Inman et al. 2021). Then, tools that quantify changes in species distributions are of great importance (Rosenberg et al. 2019; Inman et al. 2021).

Species distribution models (SDMs) are quantitative tools and statistical modeling approaches widely used in ecology to map out suitable habitat for species and to assess the potential impact of climate change on their ecological niche (Guisan et al. 2013; Inman et al. 2021). SDMs are widely used to delineate priority areas for effective management and conservation of species (Franklin 2013). Accurate species distribution maps are essential for the implementation of sustainable conservation plans (Jarnevich et al. 2015). This requires using high quality species data collected using appropriate survey methods and structured sampling methods and tools (Osborne and Leitão 2009; Duputié et al. 2014; Moudrý et al. 2017).

Unfortunately, high quality data are very costly and often spatially limited (truncated) (Osborne and Leitão 2009; Duputié et al. 2014; Moudrý et al. 2017; Inman et al. 2021). This precludes their use, resulting in reliance on the widely available presence-only (PO) data that are collected haphazardly and opportunistically (Inman et al. 2021; Suhaimi et al. 2021). Although PO data are widely available, the way they are collected introduces errors and uncertainties that make their use analytically challenging (Graham et al. 2008; Inman et al. 2021; Suhaimi et al. 2021). Consequently, they rarely meet the assumptions of SDMs.

First, while SDMs assume that sampling effort is uniform across the landscape, and that the species niche is sampled across the full range of environmental conditions in which it occurs (Phillips et al. 2009; Hastie and Fithian 2013). PO data are susceptible to sampling bias resulting from unsystematic field surveys, biased data collection from relatively accessible areas, or biased sampling effort (Graham et al. 2004; Hortal et al. 2007; Syfert et al. 2013). Even in areas where occurrence data are collected, individuals of the species may be present but undetected introducing the bias of imperfect detection (Yoccoz et al. 2001; Dorazio 2012; Chen et al. 2013).

Second, SDMs assume that the species data encompass the entire realized niche of the species (covering broad environmental gradients) (Elith and Leathwick 2009; Phillips et al. 2009; Chevalier et al. 2021). In many ecological applications, this assumption is violated because study areas are primarily defined by geographic or political boundaries that cover only a subset of a species’ realized niche (Hannemann et al. 2016; El-Gabbas and Dormann 2018). Thus, the realized niche is said to be truncated, and this can significantly degrade SDM predictions (Thuiller et al. 2004; Chevalier et al. 2021). Surprisingly, only a handful of studies have examined the effects of spatial niche truncation (see, Pearson et al. 2004; Thuiller et al. 2004; Barbet-Massin et al. 2010; Mateo et al. 2019).

Third, positional measurement inaccuracies, digitization errors, georeferencing problems, and operator error all contribute to high positional uncertainty in PO data (Graham et al. 2004, 2008; Naimi et al. 2011; Rocchini et al. 2011). While some studies have concluded that SDMs are generally insensitive to variations in the level of positional uncertainty (Graham et al. 2008; Fernandez et al. 2009; Mitchell et al. 2017), others have reached the opposite conclusion (Visscher 2006; Johnson and Gillingham 2008; Naimi et al. 2011, 2014). As a result, there is no consensus on its impact on SDM predictions.

Despite widespread criticism of the use of PO data in SDM, they are still widely used in ecology. This makes the use of integrated SDMs to be on the rise (Suhaimi et al. 2021). Integrated SDMs are a recent innovation that combine the spatial point process model approach (which includes inhomogeneous Poisson point process models, see Warton and Shepherd (2010) and Renner et al. (2015) for further details) with the hierarchical model approaches. They incorporate PO data and higher quality data (point count data or site occupancy data) into the same model (Dorazio 2014; Koshkina et al. 2017) to model species distributions, taking advantage of each data source while accounting for their respective limitations (Koshkina et al. 2017). However, another challenge lies in the appropriate use of these modeling approaches (Schank et al. 2019).

To assist ecologists in appropriately modeling species distribution based on the PO data, we must examine the performance of existing modeling approaches under a variety of scenarios of data quality (Suhaimi et al. 2021). Mugumaarhahama et al. (2022) show that in the context of sampling bias and imperfect detection in PO data, the best results are obtained by analysing PO data in conjunction with point count (PC) data using the approach introduced by Dorazio (2014). However, the extent to which the PO data used, which are susceptible to the aforementioned uncertainties, may affect the effectiveness of PO models and the integrated SDM is not yet well understood.

The focus of this study is on the use of PO data in SDM through the use of spatial point process models. This research aims to assess the impacts of aforementioned uncertainties in data on performance of SDMs. We use a virtual ecologist approach (Zurell et al. 2010; Suhaimi et al. 2021), in which we simulate the distribution of a virtual species and sample it under different conditions of data quality. In our simulations, we consider different scenarios in which a modeler has both PO data of varying quality and PC data that fully or partially cover the range of the virtual species, and must decide whether or not to use the two datasets. Specifically, we assess the marginal and combined effects of both positional uncertainty in PO data and data truncation in PO and/or PC data on the performance of PO and integrated SDMs under conditions of low and high species detectability.

Methods

To assess SDMs for factors that might influence their performance, the use of real data could lead to erroneous conclusions (Meynard and Kaplan 2013; Miller 2014; Leroy et al. 2016). However, the use of simulated virtual species has the advantage that the “true” distribution of the species is completely known and the variables that influence this distribution are all known (Hirzel et al. 2001; Zurell et al. 2010; Meynard and Kaplan 2013; Leroy et al. 2016). The main strength of this approach is the ability to compare the model output to a known (virtual) “truth”.

Simulation study

This study assesses the performance of PO models and integrated models under different scenarios of data quality. We explored the performance of these SDMs by examining their ability to estimate key parameters from simulated data whose characteristics are known. The use of virtual species, whose distributions are uniquely determined by a set of simulated environmental factors, ensures that the suitability of all species at each site is strictly determined by these factors, without additional biotic or dispersal restrictions. By simulating the distribution of the virtual species and introducing various biases into the data, and then refitting the models with these data, the resulting parameter estimates can be compared with the initial parameters of the “true” distribution and thus determine the effects of the factors under study (Hirzel et al. 2001; Zurell et al. 2010; Meynard and Kaplan 2013; Leroy et al. 2016). Figure 1 depicts the simulation framework.

Fig. 1
figure 1

General framework of the simulation process

Generating virtual species range

For the data generation process, the simulation design was similar to that described in Dorazio (2014) and Koshkina et al. (2017). We assumed that individuals of the virtual species reside within a 2D grid B which is assumed to be a square divided into 1000 × 1000 grid cells.

Two environmental covariates, x(s) and w(s), were generated using bivariate distributions that vary spatially and being independent of each other. The bivariate distributions were chosen so that both environmental covariates are defined at every point s of B [(Dorazio (2014) and Koshkina et al. (2017) are recommended reading for a more in-depth understanding].

Considering that n individuals of the virtual species reside within B, PO data are a set \(s = {s_1,s_2,s_3, \dots ,s_n}\) of point locations in B, where these individuals are recorded. It is assumed that the activity centers of observed individuals are a realization of a Poisson point process parameterized by a first-order intensity function \(\lambda (s)\) (Dorazio 2014). The process that characterizes the presence-only data is inhomogeneous because the intensity \(\lambda (s)\) varies with location depending on environmental covariates hypothesized to influence or define potential habitat of species (Franklin 2010). In this study, we used a log-linear function that depends on a single covariate x(s) to generate the intensity surface over B, which represents the “true” spatial patterns of the virtual species distributions as follows:

$$\log (\lambda (s)) = \beta _0 + \beta _1 x(s)$$
(1)

We considered \(\beta _{0} = \log (1000) \approx 6.91\) and \(\beta _{1} = 0.5\). With these arbitrarily chosen values, the lower the values of x(s), the lower the virtual species intensity and vice-versa.

Sampling PO data of the virtual species with different scenarios of uncertainty

To simulate different scenarios of data quality, we introduced errors and uncertainties in simulated PO data. As species occurrence data are prone to imperfect detectability that includes sampling bias and imperfect detection, only m of the n individuals are observed. Due to imperfect detectability, PO data are \(s = {s_1,s_2,s_3, \dots ,s_m}\) of point locations of observed individuals in B (\(m < n\)). m, the number of observed individuals depends on the thinned intensity \(\lambda (s)b(s)\) (Dorazio 2014; Koshkina et al. 2017). b(s) was simulated depending on covariate w(s) using the following equation:

$${{\,\textrm{logit}\,}}(b(s)) = \alpha _0 + \alpha _1 w(s)$$
(2)

In this study, we considered arbitrarily \(\alpha _{1} = -1.5\) and \(\alpha _{0} = -1.1\) for the low detectability scenario and \(\alpha _{0} = 4.6\) for the high detectability scenario. With these arbitrarily chosen values, the mean detectability, \({\bar{b}}(s) \approx 0.33\) for the low detectability scenario while the mean detectability, \({\bar{b}}(s) \approx 0.98\) for the high detectability scenario. Note that b(s) includes both, sampling bias and imperfect detection. As a result, observed individuals were simulated following \(\lambda (s)b(s)\), the thinned intensity (Dorazio 2014; Koshkina et al. 2017). Figure 2 illustrates the simulated intensity \(\lambda (s)\) and the thinned intensity \(\lambda (s)b(s)\).

Fig. 2
figure 2

Simulated intensity (\(\lambda (s)\)) and thinned intensity (\(\lambda (s)b(s)\)). The top panel corresponds to the high detectability scenario, and the bottom panel corresponds to the low detectability scenario. The red box represents subregion \(B'\) where truncated data are located

In addition to issues of detectability, PO data do not cover the full extent of species ranges. These PO data are said to be spatially truncated because they cover only a subset of the realized niche of species that follow geographic or political boundaries, such as country or continental borders (Thuiller et al. 2004; Hannemann et al. 2016; El-Gabbas and Dormann 2018; Mateo et al. 2019). In this study, we arbitrarily considered \(B'\), a subset of B, a rectangle area divided in 320 × 600 grid cells, to simulate the spatial niche truncation of the virtual species (see Fig. 2). To assess the effects of niche truncation on SDMs, two cases were considered for PO data in this study: the case where occurrence data cover the entire B (no truncation), and the case where occurrence data cover only \(B'\) (truncated niche).

Furthermore, among uncertainties associated with PO data is the uncertainty about where the occurrence was located. Positional uncertainty in PO data leads to a shift in point’s position in the longitudinal and latitudinal directions (Heuvelink et al. 2007; Graham et al. 2008; Naimi et al. 2011; Osborne and Leitão 2009; Hefley et al. 2014). Let’s denote \(E_i\) and \(N_i\) the coordinates (easting and northing) of where each individual species i was observed and recorded. In this study, we introduced positional error in PO data by introducing a positional error, \(\varepsilon\) in \(E_i\) and \(N_i\) using a probabilistic approach. The same approach was used in Hamm et al. (2004) and Naimi et al. (2011). This resulted in shifting the sampled species occurrences in random directions. We introduced the positional error, \(\varepsilon\), in PO data as follows:

$$\varepsilon \sim {\mathcal {N}}(0,\vartheta )$$
$$E_i= E_i + k\varepsilon _{E_i}$$
(3)
$$N_i= N_i + k\varepsilon _{N_i}$$
(4)

With k, the resolution of \(\lambda (s)\). Taking \(\varepsilon \sim {\mathcal {N}}(0,\vartheta )\) gives a normally distributed unbiased error with the standard deviation \(\vartheta\) that defines the positional uncertainty. The lower the \(\vartheta\), the higher the positional accuracy and vice versa. In this study, three levels of positional uncertainty were introduced by varying the values of \(\vartheta\). The values of \(\vartheta\) were chosen so that the corresponding positional accuracy was high (uncertainty = no pixel shift), medium (uncertainty = shift of 4 pixels), and low (uncertainty = shift of 8 pixels).

Point-count data

To fit the integrated SDMs, additional point count (PC) are required. Collection of PC data i s expensive. Thus, these data are often collected in a small contiguous sub-region of the study area (Koshkina et al. 2017). In this situation, PC data do not cover the full extent of the species’ ranges. They are said to be spatially truncated. In this study, we examine how gathering PC data from \(B'\), a subset of B, could impact the performance of integrated SDM (PBPC Model). Two situations were considered in the process of simulating PC data: full range data and truncated data. Thus, we simulated the PC data by first dividing B into 50 × 50 square quadrats of equal size. Each quadrat corresponds to 20 × 20 grid cells of B. Usually, the PO observations far exceed the number of sites visited in the PC surveys (Dorazio 2014). In this study, the quadrats sampled in the planned surveys represent 4% of the total number of quadrats. Therefore, 100 quadrats were randomly selected from B for the full PC data, while 19 quadrats were randomly selected from \(B'\) for the truncated PC data. For each quadrat, the corresponding intensity values \(\lambda (s)\) or the “true” number of individuals present in it was equal to the sum of the intensities corresponding to the grid cells that fall in it.

PC data are not prone to sampling bias because they are collected following structured sampling methods. However, imperfect detection can occur even during planned surveys. Hence, in contrast with PO data, detectability (detection probability) in PC data include only imperfection detection. In this study, we simulated each quadrat being visited by J repeated surveys, and as in Dorazio (2014) and (Koshkina et al. 2017), we assumed that the detectability, p(s) and b(s) at any site s are influenced by the same covariate w(s). For repeated planned surveys, the detection probability was simulated depending only on a single covariate w(s) as follows:

$${{\,\textrm{logit}\,}}(p_j(s)) = \gamma _0 + \gamma _1 w(s)$$
(5)

For J repeated planned surveys, the detectability was assumed to be the same for all j surveys. In other words, for any site s in B, \(p_1(s) = p_2(s) = \cdots = p_j(s)\). In this study, we arbitrarily considered \(\gamma _0 = 2.5\) and \(\gamma _1 = -1.0\).

Simulated PC data were obtained by conducting \(J = 4\) independent binomial draws from individuals of each quadrat. In other words, each simulated set of PC was computed by aggregating the realized locations of individuals in the study area into quadrats, by selecting a random sample of these quadrats, and by taking \(J = 4\) independent binomial draws from the individuals present in each sampled quadrat (see Dorazio 2014).

With all simulated data used to fit integrated SDMs, three cases of niche truncation were obtained: (i) no truncation: all data (PO and PC data) cover the whole B, (ii) partial truncation: PO data cover the whole B while PC data cover only \(B'\), (iii) full truncation: all data cover only \(B'\).

Data analysis

Three SDMs were tested in this work:

  1. 1.

    PO Model: The spatial point process model that analyzes PO data ignoring the effect of b(s). This model was fitted using simulated PO data sets solely (see Warton and Shepherd 2010);

  2. 2.

    THINPO Model: The spatial point process model that analyzes PO data as a thinned point process. This model account for b(s) based on PO data solely (see Dorazio 2014);

  3. 3.

    PBPC Model: The integrated SDM that accounts for b(s) by analyzing PO data in conjunction with PC data. This model was fitted using simulated PO data sets and PC data sets (see Dorazio 2014).

In the first stage, we tested the ability of these SDMs to estimate \(\beta _0\) and \(\beta _1\) parameters that determine the intensity \(\lambda ({\textbf{s}})\) in Eq. 1. The estimates of fitted models were compared to the “true” values used in the simulation process.

In all experiments, a total of 500 data sets containing PO observations and PC were simulated, with the SDMs then fitted to each realization of the data. \(\beta _0\) and \(\beta _1\) parameters were estimated using the BFGS (Broyden–Fletcher–Goldfarb–Shanno) optimization algorithm implemented in the optim function in R software (version 4.0.5) from the likelihood of the SDMs (Schank et al. 2019). Sometimes, the optim function failed to return an optimized set of parameters. If estimated parameters were returned from this function, we determined whether they were identifiable using the reciprocal of the condition number which is the ratio of the smallest to the largest eigenvalues of the Fisher information matrix (Dorazio 2014). The parameters of the species distribution models were considered identifiable if the reciprocal of the condition number had a value greater than \(10^{-6}\). Indeed, values of the reciprocal of the condition number close to 0 indicate poor conditioning (poor optimization) while values close to 1 indicate good conditioning (good optimization) (Golub and Loan 2013; Schank et al. 2019). Only estimates from models with identifiable parameters were considered for further analysis.

Performance assessment

Model evaluation is an essential step in model selection and determining the accuracy of the prediction. In general, model precision is measured primarily via evaluation and agreement metrics (Liu et al. 2011; Soultan and Safi 2017). In this study, the performance of each SDM was assessed at two levels: the ability of models to produce accurate operating characteristics of maximum likelihood estimates of \(\beta _k\) (namely \({\hat{\beta }}_k, k = 0,1\)), and their ability to predict accurately the species distribution \(\lambda (s)\).

Measuring performance in estimating \(\beta _k\)

The utilized performance measures to assess the performance of SDMs in estimating \(\beta _k\) are presented in Table 1.

Table 1 Measures used to assess the performance of SDMs in estimating βk

For \({\hat{\beta }}_k\), the relative bias (%Bias) was calculated for each replication while the standard deviation of \({\hat{\beta }}_k\) and the root mean squared error (RMSE) of \({\hat{\beta }}_k\) were calculated over N = 500 replications (runs) of the simulation process.

Measuring performance in predicting \(\lambda (s)\)

The Root mean squared error (RMSE) and two agreement metrics, namely the Schoener’s D index and the overall concordance correlation coefficient (OCCC), were used to assess respectively the statistical performance (accuracy) of SDMs in predicting \(\lambda (s)\) and the reliability of their predictions. The RMSE measures the unbiasedness (accuracy) of \({\hat{\lambda }}(s)\), whereas agreement metrics assess the spatial agreement between the “true” and the predicted ranges to determine prediction reliability. In other words, reliability can be used to determine the distance between predicted ranges and the “true” ranges (Soultan and Safi 2017). In this study, we determined the degree of agreement between “true” and predicted ranges by calculating the overlap of their geographical niches. We determined Schoener’s D index using the “nicheOverlap” function from the “dismo” R package. The niche overlap value ranges from 0 to 1, where 0 denotes no overlap and 1 denotes complete overlap (Warren et al. 2008; Soultan and Safi 2017). In addition, we measured the absolute agreement between the “true” and modeled ranges via a pixel-by-pixel comparison using the OCCC, a measure of agreement between two continuous datasets generated using two distinct methodologies (Warren et al. 2008). The OCCC was calculated using the “epiR” R package. The OCCC value ranges from 0 to 1, with 0 indicating 100% disagreement and 1 indicating 100% agreement between the “true” and predicted ranges. These metrics were computed for each replication (run). M = 1000 random points were selected over B and used to extract \(\lambda (s_1),\lambda (s_2),\lambda (s_3)\dots \lambda (s_{1000})\), the “true” values of \(\lambda (s)\) and \({\hat{\lambda }}(s_1),{\hat{\lambda }}(s_2),{\hat{\lambda }}(s_3)\dots {\hat{\lambda }}(s_{1000})\). The extracted values of \(\lambda (s)\) and \({\hat{\lambda }}(s)\) were then used to calculate the RMSE, the Schoener’s D and the OCCC.

Results

Obtained \({\hat{\beta }}_0\) and \({\hat{\beta }}_1\) with different SDMs using PO data prone to different sources of uncertainty

Figure 3 shows the 95% confidence ellipses for the intensity coefficients (\(\beta _0\) and \(\beta _1\)) obtained by fitting different SDMs under different types of uncertainty in data (PO and PC data). The plot illustrates the precision and accuracy with which the coefficients are estimated by each SDM. To highlight the marginal effects of each type of uncertainty, the confidence ellipses are determined using data that are not prone to the other two types of uncertainty. First, the results in Fig. 3A show that failure to account for imperfect detectability can lead to highly biased coefficients estimates (\({\hat{\beta }}_0\) and \({\hat{\beta }}_1\)), altering the estimated geographic distribution of species (\({\hat{\lambda }}(s)\)). The \({\hat{\beta }}_0\) coefficient is the most affected by the imperfect detectability. In that case, we are effectively estimating the presence-only intensity (\(\lambda (s)b(s)\)) instead of the species intensity (\(\lambda (s)\)) (see Fig. 3A). On the other hand, alternatives that attempt to account for imperfect detectability (THINPO and PBPC) give results that are more or less close to reality. In fact, the real values of the \(\beta _0\) and \(\beta _1\) coefficients fall within the confidence ellipses resulting from the THINPO and PBPC models, regardless of the detectability scenario. However, it is worth noting the widening of the confidence ellipses in low detectability situations. Second, Fig. 3B also shows a weak effect of positional uncertainty on the \({\hat{\beta }}_0\) and \({\hat{\beta }}_1\) coefficients for all three studied SDMs. We can see that the confidence ellipses contain the real values of the \(\beta _0\) and \(\beta _1\) coefficients despite the increase in positional uncertainty. With the variation of this factor, the accuracy and precision of \({\hat{\beta }}_0\) and \({\hat{\beta }}_1\) vary only slightly. However, with higher levels of imprecision than those considered in this study, it is not guaranteed that the effects will remain as small. Finally, regarding the spatial niche truncation, the same trend is observed for the effects of data truncation. For this source of uncertainty, the obtained confidence ellipses also contain points that represent the real values of the \(\beta _0\) and \(\beta _1\) coefficients, regardless of the SDM (see Fig. 3C). However, it is worth noting the widening of the confidence ellipses in the situation of full truncation (all data do not cover the full range of the species). It is mainly the loss of precision of \({\hat{\beta }}_1\). The results presented in the rest of this section illustrate the performance of the SDMs under different combinations of these three factors.

Fig. 3
figure 3

95% confidence ellipses for \({\hat{\beta }}_0\) and \({\hat{\beta }}_1\) obtained by fitting SDMs under varying detectability (A), positional accuracy (B), and niche truncation (C). The red plus denotes the “true” values of the parameters of interest (\(\beta _0 \approx 6.91\) and \(\beta _1 = 0.5\)).

Effects of uncertainties on maximum likelihood of \(\beta _0\) and \(\beta _1\)

In situation of low detectability, \({\hat{\beta }}_0\) obtained with PO Model are strongly biased. In doing so based on PO data solely, the THINPO Model has shown to improve \({\hat{\beta }}_0\). This approach alleviate bias in \({\hat{\beta }}_0\) but with high variance, which reflects a low precision. To obtain much better estimates (with low bias and high precision), the integrated SDMs are the best alternatives. The use PC data in conjunction with PO data through the PBPC Model did improve \({\hat{\beta }}_0\) over the THINPO Model by increasing their precision. Regarding niche truncation, the precision of \({\hat{\beta }}_0\) decreases slightly in the situation of full spatial niche truncation. Partial truncation becomes challenging for \({\hat{\beta }}_0\) when in addition the PO data are subject to position imprecision (low and medium precision). This behavior is observed for all detectability scenarios except that the higher the detectability, the higher the precision of \({\hat{\beta }}_0\). In the situation of high detectability, the PO model outperformed the others by giving unbiased and the most precise \({\hat{\beta }}_0\), whatever the positional uncertainty or the spatial niche truncation (see Fig. 4 and Table 2).

Fig. 4
figure 4

Relative bias in maximum likelihood estimates of \(\beta _0\) obtained by fitting PO SDMs (A) and Integrated SDM (B). The dashed red line indicates the ideal situation where the bias in \(\beta _0\) is equal to 0

Table 2 Standard deviation (and RMSE) of maximum likelihood estimates of \(\beta _0\) and \(\beta _1\) under varying positional uncertainty in the situation of low and high detectability

We notice that globally for all SDMs \({\hat{\beta }}_1\) are relatively unbiased whatever the detectability scenario and niche truncation. The only exception is for the PO Model under low detectability when PO data are truncated. However, the precision of \({\hat{\beta }}_1\) is impaired by the decrease in detectability and the spatial niche truncation. For this parameter, the low the detectability, the lower the precision. With partial or full niche truncation, the precision \({\hat{\beta }}_1\) is lower. As for \({\hat{\beta }}_1\), the partial niche truncation becomes challenging for \({\hat{\beta }}_1\) when PO data are prone to positional imprecision (see Fig. 5 and Table 2).

Fig. 5
figure 5

Relative bias in maximum likelihood estimates of \(\beta _1\) obtained by fitting PO SDMs (A) and Integrated SDM (B). The dashed red line indicates the ideal situation where the bias in \(\beta _1\) is equal to 0

Effects of uncertainties on the accuracy and reliability of SDMs’ predictions (\({\hat{\lambda }}(s)\))

All the effects of the uncertainties under study on \({\hat{\beta }}_0\) and \({\hat{\beta }}_1\) are expected to affect the estimates of the species distribution (intensity) and thus \({\hat{\lambda }}(s)\) obtained of the used SDMs. In this study we have assessed the statistical performance of SDMs through RMSE which measures the accuracy of \({\hat{\lambda }}(s)\). In addition, we measured the reliability of \({\hat{\lambda }}(s)\) through the Schoener’s D index and the OCCC which measure the spatial agreement between “true” and predicted species distribution. The Schoener’s D measures the relative agreement while the OCCC measures the absolute agreement. By doing so, we consider that a model is well performing when the its corresponding RMSE is low (near zero) and high Schoener’s D and OCCC (near 1). The results obtained are summarized in Figs. 6, 7 and 8.

Fig. 6
figure 6

Accuracy of \({\hat{\lambda }}(s)\) as shown by the RMSE obtained with different SDMs fitted using different scenario of uncertainty in data: PO SDMs (A) and Integrated SDM (B). The dashed red line indicates the median of \({{\,\textrm{RMSE}\,}}\). In this figure, the higher the RMSE, the lower the accuracy and vice versa

Fig. 7
figure 7

The spatial niche overlap between the predicted species distribution (\({\hat{\lambda }}(s)\)) with the “true” species distribution (\(\lambda (s)\)) according to the Schoener’s D index obtained with different SDMs fitted using different scenario of uncertainty in data: PO SDMs (A) and Integrated SDM (B). The dashed red line indicates the median of Schoener’s D index

Fig. 8
figure 8

The spatial agreement between the predicted species distribution (\({\hat{\lambda }}(s)\)) with the “true” species distribution (\(\lambda (s)\)) according to the OCCC obtained with different SDMs fitted using different scenario of uncertainty in data: PO SDMs (A) and Integrated SDM (B). The dashed red line indicates the median of OCCC

Imperfect detection leads to a significant loss of accuracy (significant increase of RMSE) in species distribution estimates (\({\hat{\lambda }}(s)\)) when not accounted for (see Fig. 6A). The THINPO Model and PBPC Model were able to estimate the distribution of species (\({\hat{\lambda }}(s)\)) with high accuracy (lower RMSE) compared to the PO Model. For integrated SDM (PBPC Model), in addition to being less sensitive to imperfect detectability, the positional uncertainty does not induce considerable effects on the accuracy of this model (it induces less variation in RMSE). Furthermore, the spatial niche truncation does not induce significant loss of accuracy in \({\hat{\lambda }}(s)\), except when it occurs together with the issues of positional uncertainty in PO data. In this situation, the PBPC model shows a non-negligible loss of accuracy, as expressed by RMSE values (see Fig. 6B). For the THINPO model, the effects of spatial truncation are not as negligible as for the PBPC model. And the behavior of this model becomes erratic in situations of near perfect detectability (see Fig. 6A). It should be noted that when there is no problem of imperfect detectability in the PO data, the PO model outperforms the other SDMs. In fact, under these circumstances, this model gives the best accuracy (low RMSE) (see Fig. 6).

The results of Schoener’s D index show that none of the factors studied (imperfect detectability, positional uncertainty and spatial niche truncation) induce significant effects on the spatial niche overlap of the predictions (\({\hat{\lambda }}(s)\)) with the real distribution of the species of interest (\(\lambda (s)\)), except in the case of low detectability and spatial truncation for the PO model. It should be noted that obtained Schoener’s D index values seem to minimize the effects of the studied uncertainties in data on the reliability of SDMs. Surprisingly, the results of Schoener’s D index show that, despite the effects observed in the previous results, the predictions obtained remain reliable.

On the other hand, the OCCC results are more or less in line with those of the RMSE. Imperfect detection leads to a significant loss of spatial agreement between \({\hat{\lambda }}(s)\) and \(\lambda (s)\) when not accounted for (see Fig. 8). The integrated SDM (PBPC Model), in addition to being less sensitive to imperfect detectability, the positional uncertainty does not induce considerable effects on the spatial agreement between \(\lambda (s)\) and \({\hat{\lambda }}(s)\) it gives. Furthermore, the spatial niche truncation does not induce significant loss of spatial agreement, except when it occurs together with the issues of positional uncertainty in PO data as shown in Fig. 6B. As for results of RMSE, it should be noted that when there is no problem of imperfect detectability in the PO data, the PO model outperforms the other SDMs, even in situation spatial niche truncation. In fact, under these circumstances, this model gives OCCC almost equal to 1.

Discussion

SDMs are based on a number of assumptions to guarantee their performance and the reliability of their predictions. However, it is not readily available to find PO data that meet these assumptions, which raises doubts about the reliability of the conclusions drawn. This study illustrates the effect of multi-source uncertainties in species PO data on SDMs performance. The aim is to assess the (marginal and combined) effects of these uncertainties on the ability of the SDMs to estimate the parameters of the species distribution model (intensity, \(\lambda (s)\)) and on the predictive performance of these models.

In this study, the poor performance of the PO model is evidence that imperfect detectability leads to a serious loss of predictive performance in SDMs, leading to erroneous conclusions about species ranges. SDMs predictions obtained without accounting for imperfect detectability, instead of estimating \(\lambda (s)\), the “true” species distribution, they reflect \(\lambda (s)b(s)\), the sampling efforts (Phillips et al. 2009; Fithian et al. 2015). In this context, it is challenging to distinguish between predictions that accurately reflect ecological processes that influence the spatial distribution of a species and those that are linked to detectability effects or sampling effect (Dorazio 2012; Fithian et al. 2015; Guillera-Arroita 2017). Therefore, our findings emphasize on the importance of accounting for imperfect detectability in SDMs. They corroborate findings of other works that insist on the risk of ignoring this type of bias in species occurrence data. As for our study, Phillips et al. (2009), Yackulic et al. (2013) and Guillera-Arroita (2017) showed that ignoring imperfect detectability is not inconsequential to the reliability of SDMs predictions. It leads to erroneous conclusions regarding the distribution of species, erroneous inferences regarding the determinants of species distribution, incorrect quantifications of biodiversity, and incorrect conclusions regarding environmental change (Guillera-Arroita 2017). However, our findings are not in line with some studies that recommend ignoring the effects of imperfect detectability. Indeed, there are differing opinions regarding the effects of imperfect detectability on SDMs performance (Guélat and Kéry 2018). Some studies concluded that the effects of imperfect detectability are negligible and recommended ignoring them (e.g., Banks-Leite et al. 2014; Johnson and Gillingham 2008; Stephens et al. 2015). We believe that this difference of opinion may be explained by the fact that the effects of imperfect detectability may vary according to the species eco-geographic characteristics. For example, for generalist species, although the data are prone to geographic sampling bias, they may be sufficiently representative of the environmental conditions across the full species range and then allow the SDMs to capture the favorable environmental conditions for the species. In contrast, this is not necessarily true for specialized species. The effects of imperfect detectability would also vary with the variance and importance of the underlying factors. For example, Fithian et al. (2015); Thibaud et al. (2014), and Fletcher et al. (2016) and other more recent studies such as Chevalier et al. (2021) suggested using covariates such as the distance to a roads or distance to cities as predictors of imperfect detectability. If distance to roads and/or distance to large cities are the main factors underlying imperfect detectability, then if the road network is sufficiently developed and the cities sufficiently numerous and scattered, thus covering a large part of the environmental conditions of the species, the effects of this factor may be sufficiently reduced and thus induce minor losses in model performance.

Findings of this study show that in the context of low detectability all studied SDMs produce unbiased estimates of \(\beta _0\) and \(\beta _1\), but they differ mainly in the precision of these estimates. The best accuracy is obtained with the PBPC Model. However, in the context of high detectability, PO Model outperformed the SDMs that account for imperfect detectability (THINPO and PBPC Models). It is therefore important to make a careful selection of the model to be used, taking into account the characteristics of the data. To account for imperfect detectability in PO data, several authors have proposed modeling species distribution as a thinned Poisson point process (THINPO Model) (Chakraborty et al. 2011; Fithian and Hastie 2013; Hefley et al. 2013; Warton et al. 2013; Dorazio 2014; Fithian et al. 2015). The results of this study show that this may not be enough depending on data characteristics. It may be necessary to use additional data in some context as demonstrated in this study. The results of this study are in accordance with those of some authors that argue that it is impossible to accurately estimate species distribution using PO data alone because PO data are not informative about species detectability. Additional data that are informative about species detectability are required to improve the estimation of the parameters of species distribution models (Fithian et al. 2015; Dorazio 2014; Koshkina et al. 2017). Although the use of these additional data is crucial, it is also necessary to make a reasoned choice of the covariates to be included in the model, especially the detectability variables. This choice is not trivial. It must be done as thoroughly as possible (Koshkina et al. 2017). A poor choice of these covariates could also have negative effects on the performance of the SDMs. However, our results are silent on this issue. What would be the bias in the parameter estimates if the key determinants of detectability are omitted? Nolan et al. (2022) proposed an approach to identify sources of sampling bias in PO data through the use of zero-inflated models. We must therefore try to list all the potential sampling bias drivers and then use the approach proposed by Nolan et al. (2022) to identify those that should be retained for use in the THINPO Model or integrated SDM.

In addition, the assumption regarding the data representativeness of the environmental conditions across the full species range is often violated because in most of the SDMs’ studies, PO data are typically collected in areas defined by geographical or political borders (e.g., national monitoring programs) that only encompass a subset of a species’ realized niche (Hannemann et al. 2016; El-Gabbas and Dormann 2018; Chevalier et al. 2021). These data are therefore not necessarily representative of the species range (Chevalier et al. 2021). Surprisingly, the results obtained in this study show little effect of spatial niche truncation on the maximum likelihood estimates of \(\beta _0\) and \(\beta _1\), and thus fail to induce strong effects on the predicted species distribution (\({\hat{\lambda }}(s)\)), especially for the integrated SDM. Indeed, in this study, \({\hat{\beta }}_0\) and \({\hat{\beta }}_1\) of the integrated SDM (PBPC model) are unbiased regardless of the spatial niche truncation scenario. Furthermore, small effects of spatial niche truncation on the precision of \({\hat{\beta }}_0\) and \({\hat{\beta }}_1\) are observed, which do not significantly affect the predictions (\({\hat{\lambda }}(s)\)) of this model, except when PO data are positionally imprecise in addition to being truncated. Consequently, this study suggests that the use of spatially constrained data, has little effect on the results of SDMs. And since high quality data are often very spatially constrained, there is no reason to fear that this will affect the results of integrated models. In terms of transferability, if the data used to calibrate the models are sufficiently representative of the full range of the species, the results of SDMs can be generalized to other geographic areas and time periods without concern. This is not consistent with a number of previous studies that have found evidence of severe effects of spatial niche truncation on the quality of SDM outputs. Indeed, these research stated that if species occurrence data fail to capture the full species realized niche, they can not adequately characterize the environmental conditions tolerated by species. Thus, it may not be possible to obtain reliable outputs from models built using such data (Pearson and Dawson 2003; Thuiller et al. 2004; Titeux et al. 2017). Consequently, it is difficult to estimate the function linking species distributions and environmental variables, leading to inaccurate predictions of species distributions (Chevalier et al. 2021; Thuiller et al. 2004), which can result in wasting resources on ineffective and expensive restoration plans or losing populations of conservation concern (Guisan et al. 2013; Araújo et al. 2019; Chevalier et al. 2021). We suspect that the discrepancy between our results and those of previous research may be explained by the fact that the ecological conditions of the area chosen for our spatial niche truncation simulations may have been sufficiently representative of the environmental tolerance of the virtual species. Elith et al. (2010) recommend that the similarity between the restricted data used for model calibration and the full range data (extrapolation data or projection data) be tested by multivariate environmental similarity surface analysis. If this analysis shows dissimilarities, one should pay attention to spatial niche truncation effects (Barbet-Massin et al. 2010; Bálint et al. 2011; Edman et al. 2011; Keenan et al. 2011; Bertrand et al. 2012; Raes 2012; Chevalier et al. 2021). An alternative to address this issue is to consider data from larger spatial and (especially) ecological scales (Hannemann et al. 2016; Chevalier et al. 2021). It is then preferable to use the full range of species and environmental data to calibrate SDMs rather than considering only a subset of them (Araújo and Guisan 2006). However, this alternative may not be satisfactory if SDMs are to be fitted at fine spatial scales using local predictors such as land cover and fine environmental details such as local microclimate (Pearson et al. 2004; Zellweger et al. 2019; Chevalier et al. 2021). Furthermore, we suggest that the effects of spatial niche truncation are likely to vary with respect to the eco-geographic characteristics of the species. Spatial niche truncation may have stronger effects on specialized (narrowly distributed) species than on generalist species. Since our virtual species is not highly specialized, we expect it to be less sensitive to spatial niche truncation.

In addition to the aforementioned uncertainties, the positional uncertainty of PO data is an additional concern (Naimi et al. 2011; Rocchini et al. 2011; Soultan and Safi 2017). The use of such prone-to-uncertainty PO data is hypothesized to result in inaccurate predictions of species distributions, and then misguide biodiversity management and conservation efforts. The results of this study indicate that the effects of positional uncertainty on maximum likelihood estimates of the \(\beta _0\) and \(\beta _1\) coefficients are not as severe as one might expect. These results are consistent with previous research indicating that the effect of positional uncertainty of species occurrences on the performance of SDM is relatively small (Graham et al. 2008; Osborne and Leitão 2009; Soultan and Safi 2017; Hayes et al. 2015; Fernandez et al. 2009; Mitchell et al. 2017). However, there are other studies that contradict our findings. They found that positional precision can lead to inaccurate predictions of species distributions (Visscher 2006; Johnson and Gillingham 2008; Naimi et al. 2011, 2014). We suspect that this disagreement may be related to the fact that positional uncertainty effects would vary with species characteristics, which may vary with responses to environmental covariates (Soultan and Safi 2017). Indeed, Soultan and Safi (2017) found that species specialization affects the sensitivity of SDMs to the positional uncertainty in species occurrence data. For generalist species, positional precision has a relatively small effect, whereas specialist species are more sensitive to positional uncertainty. The sensitivity of specialist species may be due to an increased probability of assigning imprecise species occurrences to inappropriate areas, whereas this probability is inherently reduced for generalist species. Species characteristics are also likely to influence the effectiveness of SDMs. The results of this study may not be applicable to all species. Therefore, future research should assess the effects of species characteristics on the performance of SDMs and the uncertainty in PO data as a function of species characteristics to determine if these SDMs continue to exhibit the same performance.

Conclusion

In this study, we used simulated data to investigate the effects of positional uncertainty and spatial niche truncation on the performance of species distribution models (SDMs) under low and high species detectability. We show that SDMs that account for imperfect detectability (THINPO or PBPC models) are not applicable in high detectability situations. In this situation, PO model produces the most accurate maximum likelihood estimates of \(\beta _0\) and \(\beta _1\), and consequently the most accurate predictions of species distributions (\({\hat{\lambda }}(s)\)). The effects of positional uncertainty and spatial niche truncation on this SDM output are minimal. However, in situations of low detectability, it is preferable to analyze PO data alongside PC data. It has been demonstrated that positional uncertainty and spatial niche truncation have negligible effects on the output of this SDM, except when positionally uncertain PO data are analyzed alongside truncated PC data. However, depending on species characteristics, the effects of positional uncertainty and spatial niche truncation may vary. They can have a significant impact on the outputs of SDMs for specialized species. Multivariate environmental similarity surface analysis is proposed to test the similarity between data from the restricted region to be used for model calibration and data from the entire range. If this analysis reveals dissimilarities, spatial niche truncation effects should be considered. Data from larger spatial and ecological scales should be considered as an alternative to address this issue. It is therefore preferable to use the full range of species and environmental data to calibrate SDMs, rather than just a subset of them. Assessing the effects of species characteristics on model performance and the effects of uncertainties on model performance as a function of species characteristics could be the subject of future research.