Provisioning forest and conservation science with high-resolution maps of potential distribution of major European tree species under climate change

We developed a dataset of the potential distribution of seven ecologically and economically important tree species of Europe in terms of their climatic suitability with an ensemble approach while accounting for uncertainty due to model algorithms. The dataset was documented following the ODMAP protocol to ensure reproducibility. Our maps are input data in a decision support tool “SusSelect” which predicts the vulnerability of forest trees in climate change and recommends adapted planting material. Dataset access is at https://doi.org/10.5281/zenodo.3686918. Associated metadata are available at https://metadata-afs.nancy.inra.fr/geonetwork/srv/fre/catalog.search#/metadata/fe79a36d-6db8-4a87-8a9f-c72a572b87e8.


Background
Climate change is likely to cause widespread shifts in the composition and range of plant communities worldwide (Scheffers et al. 2016). For long-living communities such as forests, such change may lead to a drastic decline in their ability to support multiple ecosystem services (Maroschek et al. 2009;Härtl et al. 2016;Mina et al. 2017). In Europe, the effects of climate change on forests may include changes in forest productivity (Reyer et al. 2014), changes in the distribution of tree species (Dyderski et al. 2018;Thurm et al. 2018), the economic value of forests (Hanewinkel et al. 2013), effects of intensifying disturbance regimes (Seidl et al. 2011(Seidl et al. , 2014, and droughts (Allen et al. 2010).
As such, there has been considerable interest in estimating the potential distribution of tree species under scenarios of climate change. Species distribution models (SDMs), often referred to as ecological niche models (ENMs), are the most widely used tools for this purpose (Sykes et al. 1996;Zimmermann et al. 2010;Guisan et al. 2013;Dyderski et al. 2018), because they predict the potential distribution of species by exploiting the correlation between the known occurrence of a species and corresponding environmental conditions.
In the recent decades, SDMs have evolved and were applied for a wide range of questions such as to predict species range in the future (Sykes et al. 1996;Thuiller et al. 2008;Dyderski et al. 2018), to test hypotheses about species distribution limits (Kreyling et al. 2015), to develop conservation and management strategies in climate change (Guisan et al. 2013;Hamann and Aitken 2013;Mcshea 2014;Schueler et al. 2014), and understand the role of genetic variation in tree species distributions (O'Neill et al. 2008;Benito Garzón et al. 2011;Valladares et al. 2014;Chakraborty et al. 2019;Garate-Escamilla et al. 2019).
Despite the recent improvements and widespread use, the free and unrestricted utilization of SDMs in the applied forest and conservation science is often limited due to inadequate documentation and reporting of the predictions and uncertainties. Therefore, Zurell et al. (2020) proposed a reporting protocol known as ODMAP (Overview, Data, Model, Assessment, and Prediction), which offers a standardized way of communicating the results/outputs from SDMs by describing the objectives, model assumptions, scaling issues, data sources, model workflow, model predictions, and uncertainty.
Here we present a dataset on the potential distribution of seven widely occurring tree species of Europe for current and projected future climate scenarios. To ensure transparent reporting and reproducibility, we described the dataset according to the ODMAP protocol suggested by Zurell et al. (2020). The following sections describe the basic elements of the dataset, while the detailed metadata according to ODMAP (Zurell et al. 2020) is presented in Table 2 in Appendix.

Species occurrence data
Current occurrence (presence and absence) of seven major stand forming tree species in Europe (Table 1) was obtained from the EU-Forest dataset (Mauri et al. 2017). These species are known to form stands in a wide range of forest types across Europe (European Environmental Agency 2006) and are also economically important (Hanewinkel et al. 2013). The Mauri et al. (2017) dataset is one of the most exhaustive, harmonized European tree species occurrence (presence) data available till date, which combines three existing datasets: the Forest Focus (Hiederer et al. 2011), Biosoil (Houston Durrant et al. 2011, and national forest inventories. In our case, the geographic locations of the target species in the EU-Forest dataset were assumed to be true presences, while the presence locations of other target species were assumed to be the absence locations. To ensure that the absence locations are not only climatically dissimilar but also geographically distant from the observed presence locations, we developed the so-called pseudoabsences according to Senay et al. (2013). This is a three-step approach: (i) specifying a geographical extent outside the observed presences, (ii) environmental profiling of the absences outside this geographic extent, and (iii) k-means clustering of the environmental profiles and selecting random samples within each cluster. In our case, a 2-degree buffer was found to be optimum following Senay et al. (2013). The absence locations outside this geographic extent were classified into 10-15 environmentally dissimilar clusters according to the k-means clustering algorithm. The numbers of absence clusters for each species were determined from the elbow of the plot of total within-cluster sum of square (WSS) and number of clusters. The number of pseudoabsence locations was further reduced by randomly selecting a sample of locations defined by the 95% confidence interval from each of the absence clusters. This approach was used Table 1 Occurrence (presence and absence points) for the seven tree species obtained from Mauri et al. (2017) and model evaluation statistics. The model evaluation based on mean ROC, TSS, sensitivity, and specificity of the models used to develop the ensemble predictions. For detailed model evaluation see Table 5

Climate data
Biologically relevant climate variables were obtained from the ECLIPS 2.0 dataset (Chakraborty et al. 2020a, b). This dataset was developed from dynamically downscaled, and bias-corrected regional climate model results from the EURO-CORDEX with a resolution of 30 arcsec. The EURO-CORDEX (www.euroc ordex .net) is an initiative of the World Climate Research Program (Giorgi et al. 2009) for coordinating dynamic regional downscaling of the global climate projections from the CMIP5 (Coupled Model Intercomparison Project Phase 5) (Jacob et al. 2014). All projections were corrected for bias using a distribution scaling method (Yang et al. 2010) to produce 0.11 × 0.11° resolution gridded data for daily mean, minimum, and maximum nearsurface air temperature and precipitation. We further refined this 0.11 × 0.11° resolution bias-corrected data to 30 arcsec using the delta algorithm for spatial downscaling (Ramirez-Villegas and Jarvis 2010; Moreno and Hasenauer 2016). With this approach, we developed a gridded dataset for 80 climate variables ( Table 3 in Appendix) for historic climate  and three future time frames which include averages of (2041-2060, 2061-2080, and 2081-2100)

Variable selection
From the list of potential predictor variables (Table 3 in Appendix), the ones which explain most of the variation in the observed presence and absences of each species were selected with a recursive feature elimination approach (RFE) implemented within the Random forest algorithm (Breiman 2001). Within the RFE approach, the variables were eliminated iteratively, starting from the full set of potential predictors and retaining only those variables that reduce the mean square error over random permutations of the same variable. The variables which were linearly correlated with other variables and had a variance inflation factors VIF > 5, a commonly used threshold in detecting mulicollinearity (Craney and Surles 2002;Thompson et al. 2017), were identified. The identified collinear variables with the lower value according to the Akaike Information Criteria (AIC) (Akaike 1974) were retained for further model development. This subset of uncorrelated climate variables ( Table 4 in Appendix) was used as predictor variables for developing the ensemble species distribution models.

Ensemble species distribution models
To model the potential distribution of the seven European tree species, an ensemble distribution modeling approach, implemented through the R package, biomod2 (Thuiller et al. 2016), was used. biomod2 offers a computational platform for multi-method modeling that generates models of species' potential distribution for each species. Tsuruoka. Hence, biomod2 combines the strengths of multiple modeling algorithms while accounting for their uncertainties. We used biomod2 default settings for all the modeling algorithms (Thuiller et al. 2016). Each model algorithm predicted the probability of the potential distribution for each species. Such probabilities predicted from the individual models were ensembled into a consensus model by combining the median probability over the selected models with true skill statistics threshold (TSS > 0.7) (Allouche et al. 2006;Coetzee et al. 2009). The median was chosen because it is known to be less sensitive to outliers than the mean. The estimated ensemble model predictions were presented as GeoTIFF rasters. These raster files are available at https ://doi. org/10.5281/zenod o.36869 18.

Model evaluation and uncertainty analysis
Model evaluation was carried by splitting the occurrence dataset into 75% for model training and 25% for model testing. Besides, biomod2 allows specifying the number of runs for each combination of training and testing data. Therefore, 10 independent runs, each with a randomly selected set of training and test data, were implemented. For each such model run as well as the final ensemble models, the model evaluation statistics were recorded. These statistics were true skill statistics (TSS) and area under the relative operating characteristic (ROC), model sensitivity (the ability of the model to predict true presences), and model specificity (the ability of the model to predict the true absences). TSS takes into account both omission and commission errors and ranges also from − 1 to + 1, not being affected by prevalence as KAPPA (Allouche et al. 2006). TSS values ranging from 0.2 to 0.5 were considered poor, from 0.6 to 0.8 useful, and values larger than 0.8 were good to excellent (e.g., Coetzee et al. 2009). Prediction accuracy is considered to be similar to random for ROC values lower than 0.5; poor, for values in the range 0.5-0.7; fair in the range 0.7-0.9; and excellent for values greater than 0.9 (Pontius and Parmentier 2014).
Model uncertainty was also estimated in terms of coefficient of variation (CV) among the predictions of the individual models. The estimated CVs are also presented as GeoTIFF rasters where each cell corresponds to a CV value, whereby higher and lower CV values indicate higher and lower uncertainties, respectively, in the ensemble model. These raster files are available at https :// doi.org/10.5281/zenod o.36869 18.
In addition to internal evaluation, the model predictions were also tested against independent data on European Forest Genetic Conservation Units (GCU) (Lefèvre et al. 2013). The geographic locations of the 3354 genetic conservation units ( Fig. 3 in Appendix) were used to extract the predicted probability of occurrence from the models for the seven target species for the period 1961-1990. The ensemble models were used to predict the distribution of the seven target species at each GCU location. Predicted probability < 60 were assumed to be, "incorrectly predicted," whereas those > 60% were treated as "correctly predicted" following Dyderski et al. (2018). For most species, the incorrectly classified GCUs are those located in the southeastern part of their potential distribution ( Fig. 3 in Appendix).

Technical validation
In general, for all species, a high correlation was observed between the predictive performance of the models calibrated with both training and evaluation data with mean TSS ranging from 0.79 to 0.92 and mean ROC ranging from 0.92 to 0.98 (Table 1). Average sensitivity or the ability of the models to predict true presences across all species and models range from 95 to 98% and average specificity or the ability of the models to predict true absences range 86-96% (Table 1). Detailed performance of individual models can be found in Table 5 in Appendix.
Model evaluation against independent data reveals that out of the total 3354, 80-96% of the species occurrence in the European genetic conservation unit (GCU) dataset was correctly predicted by our ensemble SDMs (Table 6 in Appendix).
The ensemble SDMs predicts a substantial change in the potential distribution of the seven target species (Fig. 1). A general trend of a northward shift in potential climate suitability (probability > 60%) was predicted, as also observed by recent studies such as Dyderski et al. (2018). Median uncertainty represented by the coefficient of variation between individual models varies between 6 and 15% and with Larix decidua and Abies alba having higher prediction uncertainty compared to other species (Fig. 2).

Reuse potential and limits
The dataset is currently being used to develop a decision support tool, SusSelect Smartphone app https ://play.googl e. com/store /apps/detai ls?id=com.topol ynx.susse lect&hl=en, which calculates the vulnerability of tree species under climate change. The dataset is also being used to develop an Integrated Toolbox that combines tools from Interreg CE, Horizon 2020, and EU Life projects. This integrated toolbox (TEACHER-CE) is under development and focuses on climate-proof management of water-related issues such as floods, heavy rain, and drought risk prevention, small water retention measures, and protection of water resources through sustainable land-use management. For details see: https :// www.inter reg-centr al.eu/Conte nt.Node/TEACH ER-CE.html. Fig. 1 Potential distribution of seven European tree species under the historical period  and predicted future scenario of 2080-2100 under RCP 4.5 andRCP 8.5 Ecological niche models or SDMs assume that the relation between climatic drivers and the species distribution remains constant also in climate change. This assumption needs to be taken into account while interpreting the results of the paper. Chakraborty D, Móricz N, Rasztovits E, Dobor L, Schueler S (2020

Appendix
Provisioning forest and conservation science with highresolution maps of potential distribution of major European tree species under climate change.   ) and three future time frames which include averages of (2041-2060, 2061-2080, and 2081-2100). The predictions were done for two Representative Concentrations RCP 4.5 and RCP 8.5 Biodiversity data overview Observation type: standardized monitoring Response data type: presence/absence data

Type of predictors Climatic
Conceptual model/hypotheses A large body of scientific studies indicate that climate is one of the major drivers of the distribution of tree species at the continental scale. We exploited this correlation between species' current occurrence and climate to develop SDMs that predict the potential distribution of the target tree species

Assumptions
We assumed that species are at pseudo-equilibrium with the environment. The source of the presence/absence data (Mauri et al. 2017) used in this study is largely from national forest inventories where tree individuals below a certain diameter at breast height are not recorded. We assume that this data collection procedure did not bias our occurrence data Since our occurrence dataset covers the whole current distribution of the target species, which represents both current and likely future climate of Europe, we safely assumed that the species retain their niches across space and time and the current occurrence~climate correlation remains stable when predicting the models for future climate   (Mauri et al. 2017) varied in their sampling intensity and design. This data was harmonized and aggregated to a spatial resolution of 1 square kilometer, in line with an INSPIREcompliant 1-km × 1-km grid Sample size The dataset includes a total of 1,000,525 occurrence records at a spatial resolution of 1 × 1 km (Mauri et al. 2017) Data filtering: Form the EU-Forest dataset we obtained 412,2881 occurrence records about the seven target species Presence-absence data: In our case the geographic locations of the target species in the EU-Forest dataset was assumed to be true presences, while the remaining locations of occurrence of other species were assumed to be the absence locations To ensure that the absence locations are not only climatically dissimilar but also geographically distant from the observed presence locations, we developed the so-called pseudo absences according to Senay et al. (2013). This is a three-step approach: (i) specifying a geographical extent outside the observed presences, (ii) environmental profiling of the absences outside this geographic extent, and (iii) k-means clustering of the environmental profiles and selecting random samples within each cluster. In our case, a 2-degree buffer was found to be optimum following Senay et al. (2013). The absence locations outside this geographic extent were classified into 10-15 (depending on species) environmentally dissimilar clusters according to the k-means clustering algorithm. The number of clusters for each species were determined with a plot of total within-cluster sum of square (WSS) and number of clusters The number of pseudoabsence locations was further reduced by randomly selecting a sample of locations defined by the 95% confidence interval from each of the clusters. This approach was used to generate pseudoabsence for all the seven species

Data partitioning
The occurrence dataset for each target species was partitioned by splitting into 75% for model training and 25% for model evaluation

Environmental predictors Predictor variables
Environmental predictors were 80 biologically relevant climate variables comprising of annual, seasonal, and monthly variables From this list of 80 variables, a small subset of potential predictor variables was selected for each target species during the variable selection process Data sources: The spatial resolution of predictor data: 30 arcsec which is roughly equivalent to 1 × 1 km or lower depending on latitude The temporal resolution of predictor variable: Historic climate ) and three future time frames which include averages of (2041-2060, 2061-2080, and 2081-2100) for two Representative Concentration RCP 4.5 and RCP 8.5 were used for the SDM predictions Geographic projection: WGS 84 (EPSG: 4326)

ODMAP elements Contents
Variable selection and multicollinearity From the list of potential predictor variables (Table 2 in Appendix), the ones which explain most of the variation in the observed presence and absences of each species were selected with a recursive feature elimination approach (RFE) implemented within the Random forest algorithm (Breiman 2001). Within the RFE approach, the variables were eliminated iteratively, starting from the full set of potential predictors (Table 2 in Appendix), and retaining only those variables that reduce the mean square error over random permutations of the same variable. The variables which were linearly correlated with other variables and had a variance inflation factors VIF > 5 as suggested by Booth et al. (1994) were identified, and the ones with the lower value according to the Akaike Information Criteria (AIC) (Akaike 1974) were retained for further model development. This subset of uncorrelated climate variables ( Table 3 in Appendix) was used as predictor variables for developing the ensemble species distribution models

Model settings
The models were run with the default settings of biomod2 (Thuiller et al. 2016) Model estimates The models estimated median ensemble probability of species occurrence and associated model uncertainty represented by the coefficient of variation Model ensemble Predicted probabilities from the individual models for each target species were ensembled as a consensus model which combined the median probability over the selected models with true skill statistics threshold (TSS > 0.7) (Allouche et al. 2006;Coetzee et al. 2009) Threshold selection True skill statistics threshold (TSS > 0.7), a commonly used threshold for SDMS (Allouche et al. 2006;Coetzee et al. 2009), was used Assessment Model performance statistics For each such model run as well as the final ensemble models for each target species, the model evaluation statistics were recorded. These statistics were true skill statistics (TSS) and area under the relative operating characteristic (ROC), model sensitivity (the ability of the model to predict true presences), and model specificity (the ability of the model to predict the true absences). TSS takes into account both omission and commission errors and ranges also from − 1 to + 1, not being affected by prevalence as KAPPA (Allouche et al. 2006). TSS values ranging from 0.2 to 0.5 were considered poor, from 0.6 to 0.8 useful, and values larger than 0.8 were good to excellent (e.g. Coetzee et al. 2009). Prediction accuracy is considered to be similar to random for ROC values lower than 0.5; poor, for values in the range 0.5-0.7; fair in the range 0.7-0.9; and excellent for values greater than 0.9 (Pontius and Parmentier 2014)

Prediction
Prediction output Predicted probabilities from the individual models and target species were ensembled as a consensus model which combined the median probability over the selected models with true skill statistics threshold (TSS > 0.7) (Allouche et al. 2006;Coetzee et al. 2009). The median was chosen because it is known to be less sensitive to outliers than the mean. The estimated ensemble model predictions were presented as GeoTIFF rasters Uncertainty quantification Model uncertainty was estimated in terms of coefficient of variation (CV) among the predictions of the individual models. The estimated CVs are also presented as GeoTIFF rasters where each cell corresponds to a CV value whereby higher and lower CV values indicate higher and lower uncertainty respectively in the ensemble model Average temperature month 01°C Tave02 Average temperature month 02°C Tave03 Average temperature month 03°C Tave04 Average temperature month 04°C Tave05 Average temperature month 05°C Tave06 Average temperature month 06°C Tave07 Average temperature month 07°C Tave08 Average temperature month 08°C Tave09 Average temperature month 09°C Tave10 Average temperature month 10°C Tave11 Average temperature month 11°C Tave12 Average temperature month 12°C TD Temperature difference between MWMT and MCMT(°C)°C Tmax_an Maximum yearly temperature°C Tmax_at Maximum autumn temperature°C Maximum temperature 01°C Tmax02 Maximum temperature 02°C Tmax03 Maximum temperature 03°C Tmax04 Maximum temperature 04°C Tmax05 Maximum temperature 05°C Tmax06 Maximum temperature 06°C Tmax07 Maximum temperature 07°C Tmax08 Maximum temperature 08°C Tmax09 Maximum temperature 09°C Tmax10 Maximum temperature 10°C Tmax11 Maximum temperature 11°C Tmax12 Maximum temperature 12°C Tmin_an Minimum annual temperature°C Tmin_at Minimum autumn temperature°C Tmin_sm Minimum summer temperature°C Tmin_sp Minimum spring temperature°C Tmin_wt Minimum winter temperature°C Tmin01 Minimum temperature 01°C Tmin02 Minimum temperature 02°C Tmin03 Minimum temperature 03°C Tmin04 Minimum temperature 04°C Tmin05 Minimum temperature 05°C Tmin06 Minimum temperature 06°C Tmin07 Minimum temperature 07°C Tmin08 Minimum temperature 08°C Tmin09 Minimum temperature 09°C Tmin10 Minimum temperature 10°C Tmin11 Minimum temperature 11°C Tmin12 Minimum temperature 12°C      Table 6 Predicted probability of occurrence of the seven target species predicted for independent data of European genetic conservation units from Lefèvre et al. (2013). Probability class of 0-40, and 40-60 were assumed to be incorrectly predicted and > 60% as correctly predicted by the SDMs Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.