1 Introduction

The ratio between the heavy and light stable isotopes in the water molecule (18O/16O; 2H/1H) is an effective tool in resolving practical problems in environmental isotope geochemistry that arise in the disciplines of hydrology, climatology, biogeochemistry etc. (Coplen et al. 2000). The stable isotope composition of hydrogen and oxygen is conventionally expressed as δ values (δ2H and δ18O respectively) reported in units per mille (‰) (Coplen 1994). The stable isotopic composition of hydrogen and oxygen in precipitation (δp) provides an insight into the origin of water vapor, the conditions prevailing during condensation and precipitation (Aggarwal et al. 2016; Dansgaard 1964). Exploiting these variations, water (precipitation) stable isotopes have become important natural tracers in the study of the water cycle (Bowen and Good 2015; Fórizs 2003). With the continuous advancement in effectiveness and availability of analytical tools, the spatiotemporal abundance of precipitation stable isotope measurements is steadily increasing (Wassenaar et al. 2021; Yoshimura 2015), providing sufficient ground for the development of spatially continuous datasets of δp variability. These datasets can be utilized for advanced hydrological applications of precipitation stable isotopes where additional information can be gained from not only having point data, but spatially continuous information. Such applications can be found in hydrogeology (Bowen and Good 2015; Clark and Fritz 1997), limnology (Birkel et al. 2018; Nan et al. 2019), water resource management (Bowen and Good 2015; Gibson and Edwards 2002), the exploration of changes in moisture source conditions (Amundson et al. 1996), animal migration studies (Hobson 1999; Hobson and Wassenaar 1996), food source traceability (Heaton et al. 2008), and forensic sciences (Ehleringer et al. 2008) as well.

In generating spatially continuous datasets of isotopic variability, traditional kriging interpolation was demonstrated to outperform other approaches right from its inception (Bowen and Revenaugh 2003) and has since become the gold standard for mapping the spatial variability of precipitation isotopes on scales ranging from the global (e.g. Bowen 2010; Terzer et al. 2013) to the regional; e.g. Hatvani et al. (2020), Hatvani et al. (2017), Kaseke et al. (2016), Kern et al. (2014). In recent years, machine learning (ML) approaches (artificial neural networks, support vector machine and similar techniques) have increasingly proven their worth in extracting patterns and insights from the ever-increasing stream of geospatial data (Gopal 2016; Szatmári and Pásztor 2019), and have opened new perspectives on the understanding of data-driven Earth system science problems (Reichstein et al. 2019). Although the number of ML applications employed in deriving isoscapes is still scarce, their number is steadily increasing in fields as diverse as studies on bioavailable strontium (Bataille et al. 2020; Funck et al. 2021), on sulfur in human remains (Bataille et al. 2021), the stable isotope composition of nitrogen and carbon in particulate matter from the Northwest Atlantic Continental Shelf (Oczkowski et al. 2016), and studies on the isotopic composition of shallow groundwater (Stahl et al. 2020).

There have been comparative studies on the performance of different techniques in relation to predicting and mapping geochemical parameters (including stable isotopes), a task in which ML turned out to perform as well as combined geostatistical tools (Hengl et al. 2018), or even better than them (Bataille et al. 2018; Li et al. 2011). Specifically, a methodological comparison between a regression kriging approach and Random forest (RF) algorithm using daily and monthly δp datasets from European subregions showed that the ML approach is capable of outperforming the traditional methodology (Erdélyi et al. 2023). The first ever ML algorithm used in estimating the spatial variability of monthly δp in Europe was released recently (Nelson et al. 2021), employing extreme gradient boosting (XGBoost) algorithm called PisoAI that uses the Global Network for Isotopes in Precipitation (GNIP (IAEA 2019)) dataset. The performance of the different ML methods in predicting a spatially continuous dataset of monthly δp has not yet been compared. In addition, it is clear that there is still much to be learned about the behavior of ML tools in estimating the spatial variability of precipitation stable isotopes, and the difficulties and problems originating from the different spatiotemporal resolutions and predictors.

The aim of the present work is to explore the stable isotopic variability of precipitation across the European continent with the use of three variants of random forest algorithms. Specifically, to compare the performance of random forest approaches with the latest ML derived precipitation δ18O isoscape for Europe. In this task, not only the GNIP, but other networks’ data is used increasing the volume of the data by ~ 33%; which is in contrast to the usual approach in subcontinental precipitation stable isotope modeling.

2 Materials and methods

2.1 Data used

2.1.1 Precipitation isotopic data and preprocessing

Monthly δp values were acquired from a total of 480 monitoring stations operating in Europe and its vicinity between 1960 and 2021 (Fig. 1), primarily from the Global Network of Isotopes in Precipitation (GNIP), extended by the inclusion of 161 stations from regional networks and individual records; see Electronic supplementary material for details and the corresponding references. In general, the data gathered were unequally distributed in space (Fig. 1A) and time (Fig. 1B,C). Until the mid-1970s, the number of active precipitation isotope monitoring stations in Europe was limited, and those primarily belonged to the GNIP (n < 20). Afterwards, their number rose progressively, owing to the growth of the Austrian Network of Isotopes in Precipitation (ANIP) (Kralik et al. 2003), reaching about 80 stations for oxygen- (δ18O) (Fig. 1B), and 79 stations for hydrogen (δ2H) (Fig. 1C) precipitation stable isotopes by the turn of the millennium. A period with a relatively high abundance of active stations was observed in the early 2000s. For instance, in 2001 the number of active stations peaked at around 155 stations for δ18O and 137 for δ2H (Fig. 1B,C). However, the spatial distribution of the monitoring sites does not provide representative coverage of the continent in this period, being rather heavily concentrated in the Mediterranean region, owing to the coordinated research project of IAEA (IAEA 2005). Therefore, to maximize the spatiotemporal coverage of the δp values available from the study region, a focus period of 2008–2017 was chosen. This period was characterized by an annual average number of stations ≥ 125 providing data in any given year (Fig. 1B,C).

Fig. 1
figure 1

Study area, monitoring sites and available data. Map of Köppen climate zones (Kottek 2006) and precipitation monitoring sites (dots) of the region studied (A). Number of δp records (δ18O (B) δ2H (C)) obtained from the GNIP stations and other data sources (see Sect. 2.1.1) for the period 1957–2020. The annual average number of active stations for δp are marked by the green line in panels B and C. The red rectangle delimits the focus period (2008–2017); see Sect. 2.3.1. GNIP: Global Network for Isotopes in Precipitation, ANIP: Austrian Network for Isotopes in Precipitation. The metadata of the stations from ‘other’ sources can be found in Supplementary Table S1

Since δ18O and δ2H are highly correlated in meteoric waters (Craig 1961; Rozanski et al. 1993) and the δ2H data are fewer in the collected dataset (Fig. 1B,C), precipitation δ18O (δ18Op) was used throughout the experiments presented here; the results, however, should be valid for both. If stations from different data sources (e.g. GNIP, ANIP, literature) were situated close to each other (within 3 km) and had parallel measurements with a difference of only < 0.05‰ for a given month, the GNIP data were discarded. For data preprocessing, local Moran statistics and deuterium-excess were considered in a way similar to Erdélyi et al. (2023). In the screening process no obvious outliers were found, therefore, no data were excluded.

2.1.2 Potential predictors of spatial variability of precipitation stable isotopes across Europe

Various environmental parameters are known to control the spatial variability of δp on continental scales (Dansgaard 1964; Rozanski et al. 1993). Physical factors affecting the transport and fractionation processes within the global hydrologic cycle that can be represented by the geographical position of a given location are known to be important drivers of δp (Bowen and Revenaugh 2003; Dansgaard 1964; Rozanski et al. 1993). In the present study latitude, longitude are in WGS84 coordinate system (EPSG 4326), elevation derived from the digital elevation model obtained from Amazon Terrain Tiles (zoom parameter 5: ground resolution ~ 3 km in average)(AWS 2021) were used. These geographical variables are, however, mere reflections of the actual drivers regulating the atmospheric mixing and fractionating processes, and these are, after all, environmental mechanisms which can be considered via physical/meteorological parameters. A wide variety of meteorological parameters influencing δp spatial variability were considered: vapor pressure, monthly mean temperature, monthly average of the diurnal temperature range, monthly precipitation amount, all of which were derived from the CRU TS4.05 0.5 × 0.5° resolution dataset (Harris et al. 2020), and saturation vapor pressure calculated (Murray 1966) from the CRU monthly mean temperature.

In addition, the distance matrix of the monitoring stations providing the training data, following the procedure of Hengl et al. (2018) were also included as a possible predictor. Specifically, the set of distances of a station from the others will serve as a predictor (called buffer distance (Hengl et al. 2018)) instead of its spatial coordinates (latitude and longitude), and this will be done for all the stations. Categorical variables were also included as the Köppen-Geiger climate region’s codes (KG; resolution 0.5 × 0.5°) (Kottek 2006), which reflect the climatic effects in a complex and aggregated way and have been used as such in other studies—e.g. (Heydarizad et al. 2021; Nelson et al. 2021)—and the month and year of a given measurement to reflect the well-known seasonality of δp (Feng et al. 2009). The geographical and meteorological predictors had been extracted to the grid cells locations using nearest neighbour resampling technique.

2.2 Methodology

2.2.1 Random forest

To predict the spatial variability of δp across Europe, random forest (RF) was used. Random forest is a non-parametric ML algorithm which assesses a combination of predictors, and in which each tree depends on the values of a random vector sampled independently, with the same distribution for all trees in the forest (Breiman 2001; Cutler et al. 2012). The generalization error of a forest is dependent on the strength of the individual regression trees in a forest, it converges to a limit as the number of trees in the forest increases (Breiman 2001). The forest’s predictions are based on the average results of the decision trees, which use bootstrap sampling to decrease the possibility of over-fitting; for details see e.g. Biau and Scornet (2016); Breiman (2001); Prasad et al. (2006). The approach was chosen, since it has been successfully applied to map isotopic parameters (e.g. Bataille et al. (2018)) and other environmental variables (Hengl et al. 2018) or predicting the biological status of water bodies (Szomolányi and Clement 2023).

Two alternative versions of random forest were employed in the study regarding the set of predictors for the focus period (2008–2017). In the RF approach, all predictors were included except for the distances between the grid cells and the monitoring stations, while in the so-called RFsp (Hengl et al. 2018) the distances of the monitoring stations from the grid cells were also used as predictors. For these two models (RF and RFsp), random forest models were calculated separately for each monthly dataset, and those models were used to make the predictions.

In the case of another variant of random forest, the whole dataset (all monthly data together from 2008 to 2017) is used, with the years and months as additional categorical predictors, as the training data from which to derive the model. Following this, the monthly isoscapes are predicted (marked as RFtg). In the meanwhile, in the case of both RF and the RFsp, the training and prediction is performed on each and every monthly dataset separately. In all random forest approaches the so-called ‘variable importance’ sequence is obtained, representing the relative importance of each predictor in a random forest model.

In the present case the randomForestSRC (Ishwaran et al. 2021) and ranger (Wright and Ziegler 2015) R (R Core Team 2019) packages were used to obtain the RF and RFsp models, respectively. The specific parameter values for e.g., nodesize and mtry were determined using the tuneRanger() function following Probst et al. (2019) using the out-of-bag observations to evaluate the trained algorithm (Breiman 1996).

Evaluation of our model results consisted of out-of-sample verification, and comparison to another ML-based spatial prediction of δ18Op variability across Europe (Nelson et al. 2021). To see the possible effect of a larger dataset in space, training was performed on (i) the GNIP data only—overlapping with the only available ML alternative for δp predictions in Europe (Nelson et al. 2021)—and (ii) the entire collected dataset (Fig. 1A, B).

When evaluating the importance of the predictors used, variable importance plots were derived using mean decrease in impurity (Gini index; Breiman (2017)), which shows each variable’s selection frequency for a split during the building of a tree (Szomolányi and Clement 2023). The variable importance plots were plotted in such a way that their information would be rendered comparable to the various random forest approaches. Therefore, all the weights were converted to percentages, and only the first 11 predictors’ explanatory power was evaluated (Sect. 3, Fig. 2).

2.2.2 Validation of the random forest monthly models of δ p variability

For validation purposes, 10% of the available data (1582 cases) were retained from each month with Basel and Rovaniemi (Figs. 3 and 4) always included, since these two were used for validation in the development of the PisoAI model (Nelson et al. 2021). This semi-random selection resulted in 14 validation stations overall (Figs. 3 and 4). It was ensured that the chosen test stations would not be closer to each other than 60 km, to avoid clustering of the validation points, and required that the amount of missing monthly data was below 25% of the total time considered (i.e., between 2008 and 2017).

In the validation process numerous statistics were used. Absolute differences were calculated to explore the performance of the random forest approaches and compare it to actual measured δp values and the PisoAI model. In all cases these were obtained as the difference of the measured value of a validation station and its nearest grid cells values to where the different ML models predicted their results. The median of the absolute differences (MAD) and the mean absolute errors (MAE) were calculated to compare the model performances. MAE measures the average absolute difference between the predicted values and the actual values in a dataset. It is a commonly used error statistic instead of, for example, root mean square error, when the error distribution is not expected to be Gaussian (Chai and Draxler 2014), as in the present case (W(1529) = 0.882, p < 0.001;(Shapiro and Wilk 1965)). Also, MAE is suggested to avoid the amplification of the higher errors compared to the lower ones (Willmott and Matsuura 2005).

Additionally, the Lin's concordance correlation coefficient (Lin’s CCC, Lawrence 1989) a hybrid error metric was included. The CCC combines measures of both precision and accuracy to determine how far the predicted and measured data pairs deviate from the line of perfect concordance (y = x line). Lin’s CCC ranges from − 1 to 1, with a perfect concordance at 1. Lin’s CCC was calculated using the CCC() function of the DescTools package, (Signorell et al. 2019).

3 Results and discussion

First, the importance of the applied predictors is shown and evaluated, followed by the assessment of the performance in time. As a last step, the overall performance of the different models is compared and discussed.

3.1 Importance of predictors

When a random forest model is built, out-of-bag (OOB) samples that are not used during the construction are attributed to all the trees of the model. These OOB samples are then used to calculate the importance of a specific variable, in the present case, this means to get an insight into the controlling mechanisms of the spatial variability of precipitation stable isotopes. There is no change in the order of the predictor importance among the most powerful ones, and there is only a slight change in the lower importance ranks (Fig. 2). Only the power of KG variable drops explicitly when the amount of data is increased (Fig. 2A–C vs. D–F).

Fig. 2
figure 2

Variable importance plots of the different Random Forest models: RF (A), RFtg (B), and RFsp (C) variants using only the GNIP data, and then all data, in panels (D), (E), and (F) respectively. For the RF and RFsp variants (A, C, D, F), the results are the average of 120 individual variable importance results derived from the monthly random forest models. In panels C and F, the predictors S_434, S_435 and S_438 are station codes, referring to the buffer distances which were found to be of particular importance. For the explanation of the method abbreviations, see Sect. 2.2.1

In the present case, the most important predictors were the meteorological variables (vapor pressure, saturation vapor pressure, and temperature) in the complex random forest models (RFtg and RFsp), as has been shown in the case of the PisoAI model using the XGBoost algorithm (Nelson et al. 2021). Diurnal temperature range, however, had the weakest predictive power in every case. This was true regardless of whether only the GNIP data (Fig. 2A–C) or all the available data (Fig. 2D–F) were used.

The most striking difference between the more powerful and less powerful predictors is seen in the case of the RFtg model (Fig. 2B, E), while this is (considerably) less so in the case of the classical RF model. Of the geographical variables, latitude was in every model among the most important drivers (Fig. 2). In the RF model it was the leading one (Fig. 2A,D), concurring with past studies in which latitude was the crucial predictor (Bowen and Revenaugh 2003) or was again amongst the most important ones (Hatvani et al. 2017; Liu et al. 2008; Terzer-Wassmuth et al. 2021; Terzer et al. 2013). In the RFsp model, the fourth most important predictor is S_438, which is one of the training station’s (Kuopio) distance to the grid points. There is no change in the order of the predictor importance among the most powerful ones, and there is only a slight change in the ranks of lower importance (Fig. 2).

3.2 Performances evaluated in time

The results unanimously indicate that all random forest variants are capable of reproducing the overall trends and seasonal patterns of measured precipitation stable isotope variability at each validation site across Europe. This—namely, capturing the amplitude of seasonal variations in δ18Op—however, had been a continuous challenge for previous regression based geospatial models, e.g. Cluett and Thomas (2020), Daniels et al. (2017), Shi et al. (2020), and in this respect ML approaches seem to outperform their regression-based counterparts (Nelson et al. 2021).

The predicted time series of δ18Op values from the validation sites (Fig. 3) showed that the random forest models generally tracked the seasonal variability of the measured values well in almost every case. A relatively small variance in δ18Op is seen at the coastal locations (Fig. 3A,B; for localities see Fig. 4C), while more pronounced seasonal cycles prevail inland (Fig. 3C–N; for localities see Fig. 4C). There is no single model that could be considered to outperform the others at most of the stations.

Fig. 3
figure 3

Time series of observed and predicted monthly δ18Op records from January 2008 to December 2017 at 14 locations (AN) across Europe. The observed values are shown with a black line, while predicted ones are colored according to the legend. Station names belonging to the GNIP network are capitalized. The different predictions r2 values to the measurements can be found next to the station names

Fig. 4
figure 4

Absolute differences in the measured and predicted values of monthly δ18Op using the different ML methods on data from January 2008 to December 2017 at 14 validation stations when only the GNIP data are included (A), and when all data has been used (B). The boxes indicate the interquartile intervals. Two upright lines represent the data within the 1.5 interquartile range, the horizontal line inside the box represents the median value (Kovács et al. 2012). The black dots represent the difference values in a horizontally random positioning view to avoid overlap. The values below the boxplots represent the median of the absolute differences in descending order. The boxplots and median values are colored according to the legend. (C) Validation stations are marked on the map by the initials typed in boldface on the axis title of panel B. Validation stations names belonging to the GNIP network are capitalized

Generally, the r2 values ranged from 0.26 (Cestas—Fig. 3B) to 0.87 (Basel—Fig. 3F). Spatial clustering can be observed in the case of the linear relationships between the validation stations’ time series and the predicted ones. The weakest correlations (0.26 < r2 < 0.7; median r2 = 0.5) were seen in the case of the SW stations (A Coruña, Cestas, Tortosa, Girona; Fig. 3A–D), while higher ones (0.66 < r2 < 0.87; median r2 = 0.78) characterized the Central European validation stations (Fig. 3E–L). Somewhat weaker r2 values were recorded for the northernmost validation stations (Fig. 3M,N) ranging between 0.64 and 0.74 with a median of 0.69. When comparing the r2 values between the different models, it becomes clear that in the case of the SW validation stations the PisoAI produced the best estimations (Fig. 3A–D). Regarding the Central European stations, the RFsp gave the best predictions in four- and the PisoAI in three out of the eight cases (Fig. 3E–L). For the northernmost stations: at Espoo the PisoAI (r2 = 0.85; Fig. 3M) while at Rovaniemi the RF approach gave the highest correlations (r2 = 0.68; Fig. 3N).

Intra-annual/seasonal variability is generally well reproduced, for example, even if a winter month is less negative than preceding and following one(s), the models were in fact able to mimic the sub-seasonal pattern, e.g., 2012 Nyon, 2016/2017 Espoo. Similar accuracy was seen in case of even smaller variations in the summer months of 2014 and 2015 at St. Gallen, as well, for example.

Regarding the interannual/interseasonal differences, if a winter was more negative than the ones generally occurring at a particular station, the models still gave quite accurate predictions. For example, in Nyon, in the winter of 2014/2015, the relative difference between the δ18Op negative peaks (near − 5‰) was well reproduced compared to the following winter season, as also in the case of the Carpathian monitoring station (Liptovsky Mikulas; Fig. 3L). There, in the Decembers of 2010 and 2012, δ18Op < − 18‰, while in early winter 2016 neither the observed, nor did the modelled δ18Op drop below − 14‰, all of which again highlights the estimation power of the ML tools for δ18Op. A summer example can be drawn from Basel and Graz, where in 2010 and 2013 the summer peak was below − 5‰, while in 2015 it was above − 5‰, and the models were able to reproduce the interseasonal differences.

Despite the general and reasonable agreement between the observed and modelled time series of δ18Op, deviations are also observable. These will be elaborated on in the following order: (i) systematic misfits, (ii) multimothly-periods, (iii) monthly anomalies/single peaks.

The most obvious systematic misfit can be observed in summer at the northern validation stations (Fig. 3M, N). The Random Forest algorithms predicted more positive values (the largest errors > 4‰) for the summer month of 2009. The degree of misfit seems to decrease from 2014 (Fig. 3M, N). The improvement of the predictions, especially at Espoo, may be supposed to be a benefit of the entering of data from closer locations to the training set. Since the contribution to the GNIP database from two Estonian stations (Tartu, Vilsandi) started in 2013. Another season-specific bias can be seen in A Coruña (Fig. 3A), where the Random Forest prediction underestimates the observed δ18Op values. The largest errors, > 7‰, appear for the summer months of 2011 and 2015.

Another type of problem is when multimothly-periods with generally poor estimations occur. For example, in Cestas from late 2013 to early 2014, all models fail to match the measured data (Fig. 3B). However, the measured data itself displays an unusual, weird pattern, showing values for winter which are less negative and ones for summer which are more negative, contradicting the expected seasonality for a midlatitude location (Feng et al. 2009; Rozanski et al. 1993) and documented for nearby French stations (Zhang et al. 2020) even in this particular interval. This raises concerns about the reliability of the Cestas δ18Op record for this time interval. Another example is Graz (Fig. 3K), where from mid-2011 to early 2012, the models were unable to match the repeatedly occurring high values (> − 5‰) in winter, which were higher than the typical level of the summer δ18Op records at this site (Fig. 3K).

Although the intra-seasonal pattern was in general well reflected in the model data (see above), still, occasionally occurring cases of monthly anomalies/single peaks are seen in which the amplitude of the prediction(s) fall behind the measurements. These typically occur in cases where negative values in winter are involved, for example, in February 2010 in Rovaniemi (Fig. 3N), November 2013 in Tortosa (Fig. 3C), and November 2015 in Graz (Fig. 3K).

3.3 Overall performances

Since the empirical inspection of the δ18Op time series did not reveal explicit differences between the models’ ability to reproduce the actual observations, the monthly absolute differences were considered together at each validation stations and plotted on box-and-whiskers plots for further analysis (Fig. 4A,B). This was done for the cases when only the GNIP data (Fig. 4A) and then all data were used in the training procedure (Fig. 4B); also see Sect. 2.2.

In case of the RFsp, (i) the median value of the absolute differences was smallest in 6 out of 14 cases considering only the GNIP data for training (Fig. 4A), and 7 out of 14 times considering all data (Fig. 4B). Meanwhile, (ii) the interquartile ranges for this particular model were smallest 4 out of 14 times with only GNIP data (Fig. 4A), and 9 out of 14 times with all data (Fig. 4B), and (iii) the upper fence maximum value was the lowest 6 out of 14 times with only GNIP data (Fig. 4A) and 9 out of 14 times with all data (Fig. 4B).

Comparing this performance to the only alternative yet to be published in the specific field of precipitation isotope composition prediction (PisoAI, Nelson et al. 2021), (i) the median value of absolute differences was the smallest 8 out of 14 times with only GNIP data (Fig. 4A), and 6 out of 14 times when all the stations were considered (Fig. 4B). (ii) The interquartile ranges were the smallest 8 out of 14 times with only GNIP data (Fig. 4A), and 5 out of 14 times with all data (Fig. 4B). Lastly, (iii) the upper fence was the lowest 6 out of 14 times with only GNIP data (Fig. 4A, and 5 out of 14 times with all data (Fig. 4B). When not only the GNIP, but all the stations available for the present study (Fig. 1) were considered, RFsp proved to provide more accurate predictions in 7 out of 14 cases (Fig. 4B). It has to be noted that in this case random forest models were trained on a larger set compared to PisoAI (Nelson et al. 2021), which was trained exclusively on GNIP data.

The worst median values were produced by the RF model when dealing with only GNIP (Fig. 4A), and the RFtg model when all data were used (Fig. 4B). There is no clear indication that the Random Forest models created by merging all the monthly datasets (RFtg) would produce better predictions than the versions calculated for every month. Among the validation stations, the most striking deviations were seen at Rovaniemi and A Coruña primarily in summer (Fig. 3A,N and Sect. 3.1). These stations are situated at a marginal position with a relative scarcity nearby training stations (Figs. 1A, 4C). Thus, without an abundance of similar training stations, the random forest algorithm struggles reproducing the empirical degree of the latitudinal effect (Dansgaard 1964).

In general, with the inclusion of additional sites in the training set compared to the GNIP data, the predictive accuracy of the random forest approaches increased in 12, 7 and 10 cases out of the 14 for the RF, RFtg and RFsp variants, respectively (Fig. 4A vs. 4B: MAE values). The biggest improvement was achieved at the validation stations located in the eastern part of the Alps (Kufstein, Wildalpen and Graz; Fig. 4C), where in fact the presence of the ANIP network (Fig. 1A: red dots) increased the total abundance to the greatest degree (Fig. 1B, C). The difference in predictive accuracy (MAD) ranged from 0.73 to 2.87‰; Fig. 4B vs. A.

No comparable improvement in MAD was observed for the validation stations in the western Alpine Region (Nyon, Basel, St. Gallen). A characteristic example is the Basel station with the addition of the ANIP data; still, the MAD increased to between 0.06 and 0.16 ‰ for all random forest variants (Fig. 4A, B). This region neighbors the ANIP network (Figs. 1 and 4C), thus some degree of improvement might have been expected. A similar situation is observed at the northernmost validation stations, where the most promising random forest variant gave poorer estimations (Rovaniemi and Espoo) when the model was trained on an extended dataset compared to the GNIP version (Figs. 4A, B).

To assess the overall performance—regardless of the spatial location(s)—of the tested ML models, the MAD, MAE and Lin’s concordance correlation coefficient values were calculated for the 14 validation stations were averaged (Table 1). In this respect, the PisoAI (MAD = 1.113‰) performed better by 0.2‰ than the most promising random forest variant (RFsp—MAD = 1.299‰), when only the GNIP data were considered; with the complete database, the RFsp method gave nearly as low MAD value (1.124‰). In case of the MAE values, RFsp resulted the lowest value (1.345‰) when all the data were included for the analysis. The highest score for the Lin’s CCC was provided by the RFsp variant when only the GNIP network data considered (Lin’s CCC RFsp GNIP only = 0.986), and further improved when additional stations (all data) were involved to the training set (Lin’s CCC RFsp all data = 0.987). Those results corroborates the previous findings, namely, that with an extended dataset the RFsp approach provides more accurate results than the PisoAI, but when only the GNIP database is used the performance of the methods remains comparable (Figs. 3 and 4; Table 1).

Table 1 Various overall error metrics obtained in the validation of the predicted products with the different methods

4 Conclusions and outlook

In the study the performance of four machine learning algorithms were compared on monthly δ18Op records from 2008 to 2017. The methods were random forest (RF), random forest employing buffer distance (RFsp), a random forest model trained using all data employing month and year as categorical variables (RFtg), and lastly, the results of a previously published model employing XGBoost (Nelson et al. 2021). The accuracy of the predictions was tested on all the available data and a subset consisting of only GNIP data. Using only the GNIP data, similar results were obtained regardless of which random forest variant was employed, but these slightly underperformed compared to the XGBoost approach. However, the inclusion of additional network data in the training set further improved the results for all random forest models (Table 1). In this respect, the model employing buffer distance RFsp gave the smallest overall error amongst the random forest variants used; indeed, it was even better than that provided by the XGBoost. The median value of the absolute differences, their interquartile ranges and upper fence maximum values were all lower for a higher number of validation stations in the case of the RFsp model than in the case of the PisoAI approach (Fig. 4A, B). The three leading predictors were vapor pressure, saturation vapor pressure, and temperature in the complex random forest models (RFtg and RFsp), a finding similar to that obtained using the PisoAI model with the XGBoost algorithm (Nelson et al. 2021). It was an obvious, yet important, observation that subregions represented by only a few stations (especially in northern Europe) benefit from the inclusion of additional data in the training dataset.

It is suggested that the next step in the research should be to increase the dataset by an even greater extent and perform comprehensive tests using the best performing methods, complemented by the XGBoost algorithm (Nelson et al. 2021) and random forest combined with Lasso regression (Erdélyi et al. 2023) which allows extrapolation outside the data boundaries (Zhang et al. 2019).

Gaining an in-depth knowledge of how machine learning methods behave under different circumstances (spatiotemporal data density, various predictors etc.) in predicting precipitation stable isotopes on continental scales is a necessary step. It is our expectation that in the near future, classical geostatistical approaches will be overtaken by geomathematical ML tools in the derivation of the next generation of (precipitation) isoscapes.