1 Introduction

Spatial atmospheric data of mountain areas is essential for meteorological prediction and natural hazard identification (Pozdnoukhov et al. 2009). Over long timescales, spatial climate data are necessary for evaluating the climate spatiotemporal patterns and trends (e.g., Beniston et al. 2010). From a hydro-climatological perspective, spatial atmospheric data are the input variable of many hydrological models (e.g., Pelliciotti et al. 2008) that are necessary for glacier modelling (e.g., Huss and Fischer 2016) and estimating future mountain water resources (e.g., López-Moreno al., 2009), which represent significant water resources for the downstream areas (Viviroli et al. 2007). Furthermore, accurate spatial climate data are required for a wide range of environmental disciplines, such as determining ecological niches and evaluating forest areas under risk of fire (e.g., Turco et al. 2018).

The accuracy of the spatial interpolation of climate variables in mountainous zones is variable depending on the temporal set defined or the month of the year included (e.g., Ninyerola et al. 2000; 20052007a; b). The complex mountain topography generates cold pools and non-linear oscillations of air temperature (e.g., Brunetti et al. 2006; Frei 2013). In addition, the different ways mountains are exposed to the main air masses causes rain shadow effects (e.g., Bonsoms et al. 2021a). The geographical factors pointed out, together with the lack of meteorological stations at high elevations, decrease the accuracy of the spatial climate data in short distances (Daly et al. 2008). Hence, a detailed evaluation of the spatial interpolation methods in mountain zones is essential.

Climate variables, namely minimum and maximum annual temperature (Tmin and Tmax, respectively) and precipitation (PP), are usually spatially interpolated using the regression methodology proposed by Agnew and Palutikof (2000) at the beginning of the twenty-first century. In their study, they introduced geographical data (elevation, distance to the sea, latitude and longitude) in a Geographical Information System (GIS) and performed a multiple lineal regression (MLR) to successfully produce climate maps of the Mediterranean Basin. The good results obtained promoted the diffusion of the method. Generally, the MLR model is combined with global, local, geostatistical and hybrid methods for mapping purposes (Burrough and McDonnell 1998b, a; Vicente-Serrano et al. 2003; Peña-Angulo et al. 2016). For instance, Ninyerola et al. (2000; 2007a; b) generated the Digital Climate Atlas of the Iberian Peninsula based on the spatial interpolation of the MLR residuals. Similarly, Peña-Angulo et al. (2016) performed an MLR with a Kriging interpolation of Tmin and Tmax for the entire Iberian Peninsula. In addition, the Digital Climate Atlas of Andorra was obtained with the spatial interpolation of the MLR residuals, using the Inverse Distance Weighting (IDW) for mapping temperature and the Splines for PP (Batalla et al. 2016). The MLR-based method (statistical model and independent variables) has been modified over the years and applied to different predicted variables. For example, in the Pyrenees, López-Moreno and Nogués-Bravo (2005) used the Generalized Additive Models (GAM) for modelling the spatial distribution of snow depth. Later, a comparison of the performances of GAM and MLR showed better results with the first method (López-Moreno and Nogués-Bravo 2006). The same conclusions were detected with other climate variables. The GAM and the Generalized Linear Models (GLMs) provided better results than the linear methods for modeling the spatial distribution of temperature and precipitation (López-Moreno et al. 2007), evapotranspiration (Vicente-Serrano et al. 2007) and fog (Vicente-Serrano et al. 2010). Some advances have also been made regarding the independent variables used. For instance, the combination of remote sensing and vegetation indexes was introduced in the MLR equations (e.g., Cristobal et al. 2008; Mira et al. 2017). The best results in terms of accuracy (RMSE) and performance (R2) for temperature and PP are usually found at a daily scale, when the geographical data and the circulation weather types (CTs) are combined. This was observed for the spatial interpolation of temperature and PP, based on an MLR (Esteban et al. 2009; Lemus-Canovas et al. 2018). Subsequently, the comparison of MLR with GAM and GLMs showed that better results were obtained with the last two methods (Lemus-Canovas et al. 2019).

Comparisons between mapping methods, namely IDW and Kriging (Ninyerola et al. 2000) and IDW and Splines (Ninyerola et al. 2007a; Ninyerola et al. 2007b; Esteban et al. 2009; Batalla et al. 2016; Lemus-Canovas et al. 2018) revealed negligible differences. Hence, further efforts for decreasing the spatial climate data interpolation error should be focused on improving the regression model. In this sense, the linear-based methods have some limitations. Linear models can underestimate climate data because some climate variables, such as PP, do not show linearity with elevation (e.g., Henn et al. 2019). In the last decades, machine learning (ML) methods have been introduced in the climatological science, showing promising results. However, no study to date has compared ML methods with MLR or GAM models for the spatial interpolation of climatological (i.e., > 30 years) data (Tmin, Tmax and PP) for an entire mountain range, or analysed the accuracy by climate sectors of a mountain range. Therefore, in this study we aim to address this knowledge gap by evaluating five tuned ML algorithms, and comparing the results with the ones obtained with MLR and GAM in the Pyrenees Mountain range as a study case. The ML methods included in this work are classified into three groups: (i) ML techniques based on non-linear approaches, such as the K-Nearest-Neighbours (KNN) and the Supported Vector Machines with radial basis kernel (SVM); (ii) ML based on Neural-Network (NN) techniques and (iii) ML based on ensemble regression trees, using boosting techniques, such as the Stochastic Gradient Boosting (GMB) or bagging techniques, such as Random Forest (RF). Topographically complex sectors have different climate patterns, and the accuracy of each method can depend on the spatial Pyrenean sector (Vicente-Serrano et al. 2003). Hence, we determined which method is the best for the entire mountain range with climate clusters from the mountain range (CL) and for different elevation ranges. This study addresses the following objectives:

  1. (i)

    To compare the performance of seven different methods for the spatial interpolation of annual Tmin, Tmax and PP in a mountain range, in this case the Pyrenees, for the 1981–2015 period.

  2. (ii)

    To analyse the accuracy and performance of the spatial interpolation methods as a function of CL and elevation range.

The article is structured as follows. In Sect. 2 the study area is presented. Section 3 provides a description of the data and the methods used. The results and discussion are presented in Sect. 4. Finally, Sect. 5 summarises the main conclusions.

2 Study area

The Pyrenees are located in the south-west of Europe, between the north of Spain and the south of France, at latitudes ranging from 42 to 43 ºN and longitudes between 2ºW and 3ºE. The highest elevations are found in the central area of the mountain system (Fig. 1), with mountain peaks exceeding 3000 m (e.g., Aneto, 3404 m.a.s.l).

Fig. 1
figure 1

Map of the study area analyzed in this work. The limits of the Pyrenees are defined according to the CLIMPY project (https://www.opcc-ctp.org/en/climpy). The coordinate reference system is ETRS89 (UTM 30N)

The Pyrenees mountain range encompasses different types of climates (Bonsoms et al. 2021b), supplies most of the annual runoff of the Ebro River Basin (López-Moreno et al. 2009) and shows marked geoecological differences (Oliva et al. 2016). The climate is governed by a prevailing westerly circulation during the cold half of the year and a relative disconnection of the general circulation moving into the eastern Iberian Peninsula, due to the topographical effects and the west–east alignment of the mountain zones (e.g., Martín-Vide and López-Bustins 2006). In the western and central zone of the southern slopes, PP from December to March is governed by negative phases of the North Atlantic Oscillation (NAO) and westerly flows (Buisán et al. 2016). The influence of the NAO decreases towards the northern and southern slopes of the eastern sector of the mountain range (Alonso-González et al. 2020a, b). In the eastern area, negative phases of the Western eastern Mediterranean Oscillation (WeMO), related to south-eastand east Mediterranean advections, govern the PP timing (Martín-Vide and López-Bustins 2006; López-Bustins and Lemus-Cánovas, 2018). The different low-frequency climate patterns that govern the study zone are in turn translated into different PP regimes and timings. Between December and March, PP is ca. > 600 mm on the southern slopes of the western area; being < 400 mm in the central area and < 200 mm in the eastern strip (Buisán et al. 2016). At a local scale, small scale variations of PP and temperature are observed. Likewise, on the valleys, basin scale-inversions during winter with stable weather conditions are frequent (e.g., Pepin and Kidd 2006). The above-mentioned climatological contrasts at regional scale suggest that it is required to analyze each spatial interpolation method depending on elevation and CL.

3 Data and methods

3.1 Data

3.1.1 Observed dataset

To avoid large uncertainties related to the climate data quality and homogenization, in this work we used the CLIMPY (Characterization of the evolution of climate and provision of information for adaptation in the Pyrenees) gridded dataset (available at https://zenodo.org/record/3611127#.YLsz-fkzY2x). CLIMPY is a transnational climate project that analyzed the evolution of the Pyrenean climate during the last half of the century, providing a grid (1 km × 1 km; over more than 50,000 km2, Fig. 1) of PP, Tmin and Tmax for the whole mountain range. The dataset was created following a reconstruction, gap filling and quality control process detailed in Serrano-Notivoli et al. (2017). The temporal period analyzed encompasses all days from 1981 to 2015. The Meteorological records are managed by the Spanish National Agency (AEMET), Météo-France (MF) and the Meteorological Service of Catalonia (SMC).

3.1.2 Predictor variables

The independent variables used in previous works in this area are generally based on geographical data (e.g., Ninyerola et al. 2000; Vicente-Serrano et al. 2003; Esteban et al. 2009; Batalla et al. 2009; Lemus-Canovas et al. 2018 and 2019). Other variables, such as the topographic wetness index or the potential solar radiation could be included. However, no significant contribution for the spatial interpolation of climate variables has been observed (see, for example, Mira et al. 2017), and therefore they have been excluded from this work.

The initial variables used in this article were the normalized values of elevation, distance to the coast (continentality), latitude and longitude (e.g., Hofstra et al. 2008). All of the independent variables are based on a Digital Elevation Model (DEM) with 90-m of spatial resolution, downloaded from the Shuttle Radar Topography Mission (SRTM). Continentality was generated using the Euclidean distance to the coast line following Ninyerola et al. (2000). A Pearson correlation test (r) was performed between the predictor variables to avoid multicollinearity. Then, continentality was excluded due to the high and statistically significant correlation with longitude (r = -0.9, p < 0.05 and r = 0.9, p < 0.05, respectively). The other independent variables do not show a relevant correlation (r < 0.4, p < 0.05). Therefore, the predictor variables are elevation, latitude and longitude.

3.2 Methods

The PP, Tmin and Tmax values (daily resolution) were averaged for each specific grid site and year. Subsequently, the data were randomly split into two groups: the training and the testing dataset. Cristobal et al. (2008) used 60% of the dataset for training the methods and the remaining for testing. On the other hand, Feng et al. (2019) split the data into 80% for training the ML methods, and the other 20% for testing. Meyer et al. (2016a, b) compared four ML methods using 33% of the data for training the algorithms and the remaining for validation purposes. We divided the data into the same proportions as this last study since we had enough climatological data to do so (Table 2).

3.2.1 Statistical and Machine Learning methods

This section presents an overview of each method. A detailed methodological description can be consulted in the supplementary material.

The MLR assumes a linear regression between the independent variables and the dependent variable. GAM is a non-parametric extension of the generalized linear models with smoothing functions to fit non-linear responses of the independent variables (Hastie and Tibshirani 1987). We included five ML algorithms: (i) the KNN (Bishop, 1995), based on the Minkowski distance of the training data points and assuming a rectangular kernel; (ii) the NN, a ML based on a distributed system of neurons that creates a wide range of non-linear functions with more than one interconnected layer (Haykin 1998). We also included (iii) the SVM, based on an exponential radial bias function (Vapnik, 1998). Regarding the regression trees, we included (iv) GBM (Friedman 2001), a boosting method that generates an additive model that reduces the loss function. Finally, (v) we added the RF, a ML algorithm based on decision trees and bootstrap aggregation (Breiman 2001).

Previous literature showed that tuning the hyperparameters of the ML algorithms provides better results than the default ML parameters (e.g., Tripathi et al. 2006). Then, we parameterized the GAM and tuned the ML algorithms (Table 1), according to a tenfold cross-validation based on a grid selection, using the caret package (Kuhn 2020) of R (R Core Team 2018). Subsequently, the methods were applied to an independent test dataset to check the accuracy and performance metrics. The detailed error metrics of the different hyperparameters tested in the training dataset can be consulted in the supplementary materials. No significant differences were found between the training and testing datasets, which evidences that there was no ML overfitting.

Table 1 Description of the methods used in this work and ML hyperparameters selected

The distribution of the data by elevation range is presented in Table 2 and discussed in the results and discussion section.

Table 2 Distribution of the training and testing datasets by elevation range

3.2.2 Evaluation of the methods

The performance of the methods was evaluated using five types of accuracy metrics: the Residuals (Eq. 1), Mean Absolute Error (MAE; Eq. 2), Root Mean Square Error (RMSE; Eq. 3), agreement index (Willmott’s D; Eq. 4) and coefficient of determination (R2; Eq. 5). The MAE and RMSE summarize the mean differences between the predicted and observed values, showing low values when the accuracy is high. On the contrary, high R2 and Willmott’s D are related to high levels of performance, and the latter is less sensitive to outliers than the other accuracy parameters (Willmott 1982).

$$\mathrm{Residual}=Oi- Ei$$
(1)
$$\mathrm{MAE}= \frac{1}{\mathrm{N}}\sum\nolimits_{\mathrm{i}=1}^{\mathrm{N}}|Ei-Oi|$$
(2)
$$\mathrm{RMSE}=\sqrt{\frac{1}{\mathrm{N}}\sum\nolimits_{\mathrm{i}=1}^{\mathrm{N}}{\left(Ei-Oi\right)}^{2}}$$
(3)
$$\mathrm W\mathrm i\mathrm l\mathrm l\mathrm m\mathrm o\mathrm t\mathrm t'\mathrm s\;D=1-\sqrt{\frac{\sum_{\mathrm i=1}^{\mathrm N}\left(Ei-Oi\right)^2}{\sum_{\mathrm i=1}^{\mathrm N}\left(\vert E'i\vert-\vert O'i\vert\right)^2}}$$
(4)
$${\mathrm{R}}^{2}= {\left[\frac{\sum_{\mathrm{i}=1}^{\mathrm{N}}(Oi-\overline{O })\sum_{\mathrm{i}=1}^{\mathrm{N}}(Ei-\overline{E })}{ \sqrt{{\sum_{\mathrm{i}=1}^{\mathrm{N}}(Oi-\overline{O })}^{2}}\sqrt{{\sum_{\mathrm{i}=1}^{\mathrm{N}}(Ei-\overline{E })}^{2}}}\right]}^{2}$$
(5)

where N is the number of samples, Ei is the predicted value and Oi is the observed values. \(\overline{O }\) is the average of the observed values and \(\overline{Ei }\) represents the mean of the predicted values. \(E{\prime}i\) is the difference between Ei and \(\overline{O }\), whilst O’i is the difference between Oi and \(\overline{O }\).

3.2.3 K-means unsupervised classification

The spatial regionalization of the Pyrenees was performed using the objective K-means method. The K-means is a non-hierarchical cluster analysis method introduced by Forgy (1965) and widely used for determining hydrological and climate zones (e.g., Carvalho et al. 2016; Alonso-González et al. 2020a, b, among other works). The CLs are defined by applying the algorithm to the whole spatial climate dataset. The K-means is based on three steps (e.g., Hartigan and Wong 1979): (i) The K centroids (\({t}_{1}\), …,\({t}_{k}\)) are randomly initialized inside the feature area. Subsequently, (ii) the distance of the data objects to the centroids (d(\({x}_{1}, tk\)))k∈\(\llbracket1,K\rrbracket, i\in \llbracket1,N\rrbracket\) is calculated, and each point is assigned to the nearest centre: \(C\left(i\right)=\) arg min k \(\left\{d({t}_{1}, {t}_{k})\right\}\). Finally, (iii) the centroids are established as the arithmetic average point of the cluster:\({t}_{k}=\frac{{\sum }_{i=1}^{N}1C \left(i\right)=k}{{\sum }_{i=1}^{N}1C \left(i\right)=k}\). The algorithm was repeated 100 times after the convergence. The spatial clustering allowed us to define different CLs ranging from k = 2 to k = 8. The optimal CL numbers were selected using the mean of the squared distances between the CL centers (BETWENSS) and the total sum of squares (TOTSS). The BETWENSS and the TOTSS were divided and multiplied by 100, obtaining the explained variance (EV). The number of CL was selected according to the slope change of the scree test (Cattell 1966). For Tmin and Tmax we retained four CL, explaining 88% of the EV. For PP, we retained five CL which explain 93% of the EV (Table 3).

Table 3 EV of each number of CL. The selected EVs are expressed in bold

4 Results and discussion

First we present a detailed characterization of the predicted and observed annual values of Tmin, Tmax and PP in the Pyrenees. We provide a fair comparison between the methods included in this work, given that all of them have been analysed with the same initial conditions. To this end, we tested the accuracy of the methods using an independent and randomly selected dataset, which constitutes 66% of the database of the total grid points. We also analyzed the spatial distribution of the accuracy and performance metrics by CLs and by elevation ranges (steps of 500 m), providing a robust evaluation of the interpolation methods.

4.1 Tmin, Tmax and PP analyses on the independent dataset

The basic statistics of Tmin, Tmax and PP are summarized at Table 4. The average observed Tmin is 4.70 ºC. The Standard Deviation (SD) of Tmin is 2.21 ºC, with an annual amplitude higher than 10 ºC. The maximum observed Tmin reaches 9.79 ºC, whereas the minimum Tmin is -1.86 ºC. The MLR and the NN overestimated the minimum Tmin (-3.02 ºC and -2.02 ºC, respectively). The MLR (GAM) underestimated (overestimated) the maximum Tmin values 8.18 ºC (10.07 ºC). The remaining methods reproduced the observed values. RF shows the lowest bias, both for minimum (-1.65 ºC) and maximum Tmin (9.74 ºC).

Table 4 Descriptive statistics of the observed (OBS) and predicted annual Tmin, Tmax and PP values

The average observed and estimated Tmax values are the same: 12.63 ºC. The observed SD is 3.07 ºC, with a minimum Tmax of 2.62 ºC and a maximum of 17.83 ºC. The MLR estimated the lowest minimum Tmax values (2.03 ºC). The contrary is observed with the GAM (3.58 ºC). The ML methods overestimated the minimum Tmax (3.50 ºC, on average). However, all the ML show a good performance in the maximum Tmax estimation (17.83 ºC vs 17.30 ºC, observed and estimated values, respectively).

The average observed and estimated annual PP for the entire Pyrenees is 1060 mm. The SD of all the interpolation methods is approximately 430 mm, except for the MLR and GAM (330 mm and 388 mm, respectively). The geographical sectors of the range where the maximum PP is found (3444 mm) almost double the values estimated by the MLR (1831 mm).

4.2 Evaluation of the accuracy and performance of the methods

Overall, the methods show good accuracy and performance, since they were trained with a large sample size (Table 2). The ML methods outperformed MLR and GAM in the spatial prediction of Tmin, Tmax and PP. For Tmin, the RMSE ranged from 0.28 ºC (KNN, NN, GMB and RF) to 0.62 ºC (MLR). The KNN, NN, GMB and RF show the lowest MAE (0.19 ºC). The same is observed with D; except for GAM and MLR, all the algorithms reach the optimal value (1). The worst R2 is measured with MLR (0.92) and GAM (0.96). The ML methods show the same performance (R2 = 0.98). The R2 is slightly lower for Tmin than for Tmax (Table 5). The GMB scores the lowest Tmax error (RMSE = 0.64 ºC, MAE = 0.43 ºC). The worst spatial interpolation method is MLR (RMSE = 0.92 ºC, MAE = 0.70 ºC) and GAM (RMSE = 0.79 ºC, MAE = 0.57 ºC) followed by SVM (RMSE = 0.71 ºC, MAE = 0.48 ºC).

Table 5 Accuracy and performance values, grouped by variable and method

For PP, the RMSE of MLR (315.21 mm) and GAM (239.04 mm) is almost double that measured with the ML algorithms (ca. 120 mm; Fig. 2). Three ML methods show the best spatial prediction of PP: the RF (RMSE = 118.89 mm; MAE = 70.44 mm) followed by GMB (RMSE = 128.83 mm; MAE = 82.98 mm) and KNN (RMSE = 136.77 mm; MAE = 80.26 mm). For temperature, the ML methods explained only ca. 6% more of the variance than MLR. For PP, there are large differences between the linear methods and ML (Fig. 3). MLR and GAM are not able to predict the highest values of PP, reaching only 52% and 72%, respectively, of the explained variance. However, the ML methods are able to predict ca. 40% (20%) more of the variance than MLR (GAM). NN and SVM show good results (R2 = 0.85 and R2 = 0.86, respectively), but worse than those obtained by KNN and GMB (R2 = 0.91 and 0.92, respectively). RF has the best performance (R2 = 0.93).

Fig. 2
figure 2

a Average RMSE detected by each spatial interpolation method. b Empirical Cumulative Distribution Function (ECDF) of the residual values

Fig. 3
figure 3

Density scatterplot of the observed (horizontal axis) and predicted (vertical axis) values grouped by method

Our results are in accordance with Duhan and Pandey (2015), who showed that ML methods (i.e., the SVM) are able to better estimate monthly Tmin and Tmax than MLR-based methods for climate downscaling. The ML techniques, such as RF in combination with IDW (Fig. 4), have been successfully applied for environmental modelling (e.g., Li et al. 2013). Regression trees were found to be one of the most precise ML types for the spatial interpolation of temperature. This was observed for the interpolation of ground-based temperature data (Appelhans et al. 2015) as well as for remote sensing products (e.g., Noi et al. 2017; Meyer et al. 2016a, b). The RF model could be improved for the spatial prediction of temperature and PP by incorporating the nearest geographical data as a predictor variable (Hengl et al. 2018), obtaining better results than other geostatistical methods alone, such as IDW or Kriging (Sekulić et al. 2020). The computation training of the ML algorithms requires more time than linear approaches (nearly 10 h for all the variables analyzed, Table 1). However, ML techniques, such as RF, do not rely on linear effects or data distribution (Rodiguez-Galiano et al. 2012) and are robust to outliers (Breiman 2001). Moreover, RF parametrization depends on only two parameters (nº trees and mtry). The reconstruction and the gridding of climate data are based on the same methods (Serrano-Notivoli and Tejedor 2021). For the first of the objectives, KNN is one of the most widely used methods (e.g., Begueria et al. 2016). In this article KNN provided good results, but the algorithm depends on the density of the data. Further works should test the climate reconstruction based on the RF model presented in this work, and include data from the nearest stations (Vicente-Serrano et al. 2010), the average monthly values together with other proxies, for instance, low-frequency climate modes.

Fig. 4
figure 4

Comparison of the average yearly values for Tmin, Tmax and PP with MLR (left column) and RF (right column). Mapping was performed with IDW (power = 2). The coordinate reference system is ETRS89 (UTM 30N)

4.3 Distribution of the errors by CL and elevation range

Figure 5 shows the Tmin, Tmax and PP lapse rates (by 100 m) and the accuracy metrics grouped by elevation range. For PP, the difference by each 100 m is 41.59 mm/100 m. The average Tmin (Tmax) lapse rate is 0.31 ºC (0.40 ºC)/100 m. MLR and GAM overestimated the lapse rate, whereas the ML methods slightly underestimated it.

Fig. 5
figure 5

Averaged values of each variable grouped by elevation and method

PP does not follow a Gaussian distribution and the interpolation accuracy should be evaluated by splitting the data using percentiles or elevation ranges (Serrano-Notivoli and Tejedor 2021). Our results show that the differences between the ML and linear approaches increased by elevation for all the analyzed variables (Fig. 5, 6 and Table S1). For Tmin and Tmax, the largest differences between the observed and MLR estimated values are found at > 2000 m. Whereas for PP, the largest bias is found at > 1500 m. The ML methods show small differences in the accuracy metrics. The most accurate method for estimating Tmin is RF (MAE = 0.07 ºC), followed by GMB (MAE = 0.10 ºC). RF is the most accurate method for Tmax (MAE = 0.40 ºC). GAM shows a good performance for PP (MAE = 145.42 mm), similar to ML methods, such as SVM (MAE = 151.84 mm). Nevertheless, the decision trees outperformed the spatial estimation of PP. The best results were obtained by RF (avg. MAE = 77.61 mm) followed by GMB (avg. MAE = 107.48 mm).

Fig. 6
figure 6

Spatial representation of the best performance methods grouped by CL. The coordinate reference system is ETRS89 (UTM 30N)

The decrease in performance in high-elevation areas could be because more than 80% of the data is found at < 1500 m (Table 2). The results agree with those of Herrera et al. (2019), who showed that the density of the meteorological stations explains a large variance (60%) of the climate interpolation accuracy. The linear approaches underestimate the climate values of the high lands, since they are based on extrapolations of climate data found at low elevations. Henn et al. (2018) compared the accuracy of six gridded PP datasets of the western United States, all of them based on weighted linear approaches. They found significant bias in high-elevation areas, which was mainly due to the lack of data in elevated sectors, inhomogeneities and missing data, together with the PP underestimation caused by undercatch. In the Pyrenees, the majority of the meteorological stations are placed in flat and valley zones (Batalla et al. 2016), and in high-elevation sectors of the eastern Pre-Pyrenees, meteorological instrumental records are only available since the earliest 2000s (Bonsoms et al. 2021b). Therefore, the low accuracy could be explained both by the interpolation methods and by the data uncertainty in elevated areas.

The spatial interpolation performance could be different depending on the spatial scale and the geographical sector of the range (Vicente-Serrano et al. 2003). Therefore, we have provided a spatial evaluation of the accuracy of Tmin, Tmax and PP for Pyrenean areas, based on the K-means unsupervised classification method (Fig. 7). For Tmin, GAM shows good results (RMSE = 0.40 ºC), but all the ML methods are more accurate than GAM (RMSE < 0.30 ºC) with negligible differences between then (Table 6). The same is observed with Tmax. GMB shows the lowest RMSE (0.64 ºC) by CL, but only decimal differences are found in comparison with the other ML algorithms. For PP, the regression trees show the best results for all the CLs. The RF algorithm obtained the lowest error (average RMSE = 117.16 mm). The minimum error metrics are found in the driest zone (CL1, RMSE = 53.97 mm) and the maximum is found in the wet Atlantic area (CL5, RMSE = 208.07 mm). SVM and NN show a remarkable lack of accuracy in comparison with the regression trees at CL 5 (Fig. 8c). These methods are not able to estimate the extreme values of PP in the wet Atlantic area (RMSE = 304.58 mm and 308.57 mm, respectively). The results provide evidence that the RF algorithm performs better than the other ML methods for the spatial prediction of the extreme low (CL 1) and high (Cl 5) values. The error observed at CL5 between the ML methods could be explained by the structure of each algorithm. RF is a non-parametric algorithm that builds a wide range of models, separates the nodes of the trees (mtry) and selects the best features for each point; nevertheless, this parameter decision is not equivalent in SVM (Pelletier et al. 2016). Moreover, that there are no differences between the training and testing datasets suggest that there is no overfitting of RF.

Fig. 7
figure 7

MAE values of each variable grouped by elevation and method

Table 6 RMSE of each method grouped by CL
Fig. 8
figure 8

Distribution of the average residuals (the difference between the observed and the estimated values; horizontal axis) and RMSE (vertical axis). The points are grouped by CL (shape) and method (colors)

The residuals are the difference between the observed and predicted values. The spatial representation of the residuals shows how different each CL is regarding the overall model (Fig. 9). The map of the residuals also shows the geographical settings where it is necessary for the meteorological stations to increase representativity (Herrera et al. 2019). For Tmin and Tmax, the K-means clustering followed an elevation pattern. The largest errors are found in the highest elevation area (CL 1). The climate maps based on MLR and GAM produce large errors in the estimation of PP (Fig. 9). A PP underestimation can be found on the leeward side. However, the results show an overestimation of PP values in the driest, rain-shadowed zone of the southern slopes of the Pyrenees (CL 1).

Fig. 9
figure 9

a Tmin, b Tmax and c PP residual maps. The spatial interpolation of the bias was performed with IDW (power = 2). The coordinate reference system is ETRS89 (UTM 30N)

4.4 Variable of importance

The most important predictor variable for estimating PP, Tmin and Tmax is determined using the Percentage of Mean Decrease Accuracy (IncMSE) of RF (Fig. 10). The IncMSE measures the decrease in the model performance when one variable is excluded. The Tmax shows a higher dependence on elevation (IncMSE 76%) than Tmin (IncMSE 66%). The temperature could be estimated with the elevation; however, the incorporation of other geographical variables (e.g., latitude; Vicente-Serrano et al. 2003) improves the model’s performance. For PP, the largest dependence is observed with latitude (IncMSE 42%), followed by elevation (IncMSE 32%) and longitude (IncMSE 25%).

Fig. 10
figure 10

Percentage of Mean Decrease Accuracy of each predictor variable

There is a general decrease in PP towards the eastern strip (Fig. 4). This is because the western Pyrenees rain-shadows the eastern area (Pepin and Kidd 2006). Latitude, instead of elevation or longitude, is the geographical factor that governs PP. Latitude indirectly expresses the north or south face of the mountain range and the exposition of the CTs. On the southern slopes of the Pyrenees, PP is governed by west and south-west CTs during the cold half of the year; in addition, the influence of north and north-west CTs is reduced moving from north to south of the mountain range (Esteban et al. 2009; Lemus-Canovas et al. 2018). In summer the precipitation is convective (Xercavins, 1985), reducing the predictability of PP based on the independent variables included in this work. Similar patterns are observed with temperature. For the spatial interpolation of Tmin and Tmax, Ninyerola et al. (2000) found the lowest accuracy during winter months. Similar results are observed in Cristobal et al. (2008). In the Alps, the maximum error for the interpolation of temperature was also observed in winter (MAE = 1.5 ºC; Frei 2013). However, the highest accuracy is generally observed during the summer months (Ninyerola et al. 2000; 2007a). This is because the CTs that prevail in the study area during the coldest months (i.e., January) are anticyclonic (Bonsoms et al. 2021a). In these synoptic situations, basin scale-inversions usually occur (Pepin and Kidd 2006) and temperature does not follow a linear lapse rate with elevation. The climatological period analyzed in this article provides a robust mean of every grid site, averaging local effects. However, this supposes a loss of details at hourly scales when many weather situations between the valleys and the culmination zones are observed (e.g., nocturnal inversions in the boundary layer during stable weather conditions). In this sense, further works should address whether the non-linear structure of the ML algorithms can provide optimal results during temporal scales shorter than a day.

5 Conclusions

In a changing climate, evidence is accumulating that it is essential to assess the accuracy of the spatial reconstruction, interpolation and projection of the climate variables. In this article we provide insights about the spatial interpolation of climatological variables in a mountain range, comparing the traditional linear approaches (MLR and GAM) with ML methods. Our results show that the ML methods outperform the linear approaches. At the same time, no significant differences between the ML methods are observed for the interpolation of Tmin and Tmax. Our results highlight that ML regression trees better estimate the non-linearities between the geographical factors (latitude, longitude or elevation) and the climate variables. If the evaluation metrics are applied for the entire Pyrenees, four ML methods, called KNN, NN, GMB and RF have shown the same accuracy for the spatial interpolation of Tmin (RMSE = 0.28 ºC). For Tmax, the most accurate method is GMB (MAE = 0.43 ºC), and for PP the most accurate method is RF (MAE = 70.44 mm). If the analysis is carried out for each Pyrenees cluster and by elevation range, then RF is the most accurate method for the spatial interpolation of all the variables. The largest difference between the linear approaches and ML techniques is found with PP. The MLR approach underestimated all the variables and only explained 52% of the PP variance. ML methods based on regression trees were able to reproduce almost the totality of the PP variance (i.e., 93% for RF).

The evidence given in this article is not restricted to the spatial interpolation of climate data. The results suggest that ML methods would also be more accurate for the climatological spatial reconstruction, prediction, and for better understanding the variable of importance of each covariable included in the spatial models. The different regression tree ML methods show similar levels of accuracy. Therefore, further works should test the spatial climate data interpolation by including other independent variables and improving the time and spatial resolution.