1 Introduction

Kilometre-scale simulations have been increasingly used to construct an improved representation of local weather and climate (Hohenegger et al. 2009; Bauer et al. 2011; Froidevaux et al. 2014; Imamovicet al. 2017; Kirshbaum et al. 2018). In the last years, several studies have shown the benefits of using regional climate models (RCMs) at spatial resolutions of a few kilometres, often called convective permitting or kilometre-scale simulations (Warrach-Sagi et al. 2013; Prein et al. 2015; Schär et al. 2020; Coppola et al. 2020; Ban et al. 2021; Pichelli et al. 2021). At these scales, models are expected to explicitly resolve deep convection, without using error-prone deep-convection parametrisations (Emanuel and Zivkovic-Rothman 1999) and thus to better represent convection, precipitation and related processes (Prein et al. 2017; Berthou et al. 2018; Fumière et al. 2020; Knist et al. 2020; Schär et al. 2020; Schwitalla et al. 2020; Caillaud et al. 2021), as well as other key variables, such as wind and temperature (Ban et al. 2014; Belušić et al. 2018; Schwitalla et al. 2020) and land–atmosphere interactions (Mooney et al. 2020a, b; Knist et al. 2020; Barlage et al. 2021). The rapid growth of available computational power has allowed the emergence of numerous kilometre-scale experiments (CPRCMs hereafter), spanning ever larger domains and longer periods (Prein et al. 2015; Leutwyler et al. 2017; Schär et al. 2020; Schwitalla et al. 2020; Ban et al. 2021; Pichelli et al. 2021). However, an increase in model horizontal resolution does not necessarily add value in the sense of a better representation of reality by such simulations. The experimental setup (nesting strategy), the boundary forcing (spatial–temporal resolution), the complexity of the region of interest (land-sea contrast, orography), the coverage and quality of observations, interactions between remaining parametrisations (Dirmeyer et al. 2012; Torma et al. 2015; Prein et al. 2015), and the temporal scale (Kotlarski et al. 2014) are also highly relevant factors when ascertaining added value. The EURO-CORDEX, the European domain of the Coordinated Regional Climate Downscaling Experiment—CORDEX (Giorgi et al. 2009; Gutowski et al. 2016; Jacob et al. 2020) with its two horizontal resolutions (0.44° and 0.11°) ensembles provided a wealth of simulations enabling extensive analysis of different variables at various temporal and spatial scales (e.g., Vautard et al. 2013; Kotlarski et al. 2014; Jacob et al. 2014; Katragkou et al. 2015; Casanueva et al. 2016; Prein and Gobiet 2017; Cardoso et al. 2016, 2019; Knist et al. 2017; Soares et al. 2017; Terzago et al. 2017; Frei et al. 2018). Kotlarski et al. (2014) found that spatial averaging over large areas along with temporal averaging to seasonal scales cancels any added value from the increase in resolution. While the increase in resolution improved the representation of the extreme temperatures in coastal and mountain areas, no overarching benefit was determined by Vautard et al. (2013) in their EURO-CORDEX analysis of mean temperature above the 90th percentile. Nevertheless, Casanueva et al. (2016) and Prein and Gobiet (2017) agree that increasing the resolution decreases precipitation biases and improves its spatial distribution.

Within CORDEX, several Flagship Pilot Studies (FPS) were approved. One is devoted to the study of “Convective phenomena at high resolution over Europe and the Mediterranean” (FPS-Convection) and is promoted by both EURO-CORDEX and Med-CORDEX (Ruti et al. 2016). FPS-Convection aimed at building the first multi-model ensemble of kilometre-scale simulations over an Alpine domain (Coppola et al. 2020) and embracing the scientific challenge of a rigorous and quantitative assessment of the added value of regional downscaling at kilometre-scale over Central Europe. A set of hindcast simulations, forced by the ERA-Interim (Dee et al. 2011) reanalysis, and historical and future climate scenario simulations were performed, spanning time slices of at least 10 years each. These simulations were focused on the Alps and surrounding areas due to their orographic complexity, precipitation extremes, and exposure of nearby Mediterranean areas to other types of climate extremes such as heat waves and droughts. The region is also well-covered with high-density, high-resolution observational data. Almost all participating modelling groups adopted a nesting downscaling strategy, composed of a European domain at either 12, 15 or 25 km resolution, and an Alpine domain at 2–3 km spatial resolution. The hindcast simulations of the FPS-Convection were evaluated by Ban et al. (2021) for precipitation. The results showed that CPRCMs generate a more realistic representation of precipitation spatial distributions, intensities, and diurnal cycles than the respective coarse resolution RCMs, while the most significant improvements are found for heavy precipitation and precipitation frequency on both daily and hourly time scales in the summer season. These advances are linked to improvements in orographic features or even interactions of thermodynamically induced convective activity with orography. The assessment of climate change effects on precipitation by Pichelli et al. (2021) showed an enhancement of the projected changes at convection-permitting scales compared to the coarser resolution models and changes in the amplitude of the summer diurnal cycle, frequency, and intensity of the precipitation. However, the added value of temperature remained objectively unexplored.

Added value assessments provide some indication of the areas where a higher resolution model improves agreement with observations for variables or processes of interest compared to its coarser resolution counterpart. These kinds of studies assess the relative skill of an RCM/CPRCM relative to the respective driving data in the representation of a variable or process (Bärring and Laprise 2005; Rockel et al. 2010; Di Luca et al. 2012, 2013). In this context, added value is the improvement of some statistics obtained by some higher resolutions, in contrast to the driving simulations. Following Di Luca et al. (2012) there are two types of sources for added value: the first can be attributed to processes which are parameterized, i.e., not explicitly resolved by the large-scale forcing, and the second one is attributable to the scales which are of the same magnitude as the forcing simulation. Thus, higher-resolution RCMs/CPRCMs are expected to add value at small spatial and temporal scales. Yet, this is not consistent among different variables, as topography, driving simulation and the statistics considered in the assessment, have an impact on the quantification of the added value (Prömmel et al. 2010; Feser et al. 2011; Di Luca et al. 2012, 2013, 2016; Vautard et al. 2013; Prein and Gobiet 2017; Soares and Cardoso 2018; Cardoso and Soares 2022). This information is then useful for the development of downstream impact assessments, climate services and model improvement. There are many ways to assess added value. However, by itself, an added value assessment is not sufficient to determine how well a simulation represents reality (Giorgi et al. 1994, 2016; Kanamitsu and Kanamaru 2007; Kanamitsu and DeHaan 2011; Di Luca et al. 2012, 20132016; Torma et al. 2015; Lucas-Picher et al. 2017; Soares and Cardoso 2018; Fantini et al. 2018; Cardoso and Soares 2022; Ciarlo et al. 2020; Careto et al. 2022a, b).

To quantify the added value of high-resolution simulations with respect to the coarser resolution parent domains or forcing simulations, Soares and Cardoso (2018) developed a metric based on the ability of models to reproduce the observed probability density functions (PDF). This metric may be applied to the full PDF or PDF sections only, e.g., related to extremes. These distributional added value (DAV) analyses were previously used to characterize the added value of precipitation over Europe in the EURO-CORDEX hindcast climate simulations at 0.11° and 0.44° resolutions (Soares and Cardoso 2018), in comparison with the European Climate Assessment & Dataset (ECA&D, http://www.ECAD.eu) v11 local weather stations. They demonstrate the gains related to the use of a higher resolution RCM to capture precipitation, when compared with ERA-Interim, namely for precipitation extremes. Soares and Cardoso (2018) also showed that added value associated with the increase in resolution within RCM resolutions is model, season and region dependent. Cardoso and Soares (2022) applied the same methodology to EURO-CORDEX maximum and minimum temperatures. The assimilation of temperature observations in ERA-Interim perturbs the interpretation of the results. Not all models displayed added value through the increase in resolution and the results were, in general, season and region dependent. Although no added value is found for the Alpine region, the increase of resolution from 50 to 12 km improved the simulated DAV results. The lack of added value in the Alps was mostly associated with the higher altitudes, due to difficulties in the models to correctly simulate snow accretion and the snow-albedo-temperature-feedbacks, as well as temperature inversions and cold drainage in the valleys. Another example is the use of this metric in the Iberian Peninsula considering all available EURO-CORDEX simulations at 12 km resolution in Careto et al. (2022a) for precipitation and Careto et al. (2022b) for maximum and minimum temperature. Again, the added value is clearer for precipitation, and, as in Cardoso and Soares (2022), near coastal sites for temperature. Another example of the successful application of the DAV in the Iberian Peninsula is by Molina et al. (2022) for wind speed. The authors consider the EURO-CORDEX at 12 km and 50 km resolution from the hindcast and historical simulations and found improvements with the increasing resolution, particularly for wind extremes. Di Luca et al. (2013) also highlight that added value in temperature downscaling is related to the improved representation of local atmospheric circulations such as sea breezes and small-scale topography. It is thus imperative an added-value analysis of the impact of increased resolution on the near-surface temperature downscaling for RCM/CPRCM.

In this study, the DAV metric is applied to near-surface temperature to assess the added value of the FPS-convections CPRCM and RCM hindcast simulations. This assessment relies on state-of-the-art observations from the recent regular gridded E-OBS v21e (Cornes et al. 2018), the European Meteorological Observations with a regular horizontal grid with 5 km horizontal resolution (EMO-5: Thiemig et al. 2022), and ECA&D v18 datasets (Klein Tank et al. 2002; Klok and Klein Tank 2009), for the points within the high-resolution domain. The main research questions are: (1) what is the added value of dynamically downscaling ERA-Interim using RCMs down to convective permitting scales? (2) Since convective permitting resolutions approach local scales what is the added value in considering local stations as a reference? The full PDFs of both maximum and minimum temperatures are analysed as well as their extreme tails. In this way, we quantify the added value of kilometre-scale runs focusing on the Alps, for temperatures and their extremes. The following section introduces the data and a description of the method used to diagnose added value. Results are presented in Sect. 3. Finally, the main conclusions are explained in Sect. 4.

2 Data and methods

2.1 FPSCONV simulations

In this work, the added value of dynamical downscaling of the 0.7° (~ 75 km) resolution of ERA-Interim, at 12–25 km and 2–3 km, were assessed. An ensemble of twenty-one simulations using eight different regional climate models was used (Table 1). The modelling groups considering the same model coordinated a different physics configuration for each version, such as those using the Weather and research forecasting (WRF) model (Skamarock et al. 2008). The experiment set a lower-resolution pan-European domain encompassing the entire EURO-CORDEX region for most of the groups (Fig. 1a), and a nested higher-resolution domain covering a region over the Alps (Fig. 1b). Depending on the model, slightly different resolutions are used in both domains, from 12 to25 km for the outer domain and covering the whole EURO-CORDEX, Med-CORDEX or an extended alpine domain, down to the 2 to 3 km of the nested grids. The exceptions are the simulations from MOHC and JLU, which did not adopt a nesting strategy, and directly downscale ERA-Interim to the kilometre-scale grid. The hindcast simulations span at least 10 years, covering the 2000–2009 period, with 1999 as a spin-up year for most of the models. More details of the model configurations are shown in Table 1. Ban et al. (2021) provide additional information on each model setup.

Table 1 Short description of the FPS-Convection simulations driven by the ERA-Interim reanalysis for the RCMs covering the EURO-CORDEX, Med-CORDEX or an extended alpine domain and CPRCMs for an ALP domain (Fig. 1)
Fig. 1
figure 1

a EURO-CORDEX full domain, taken from the WRF simulations with the area delimited in blue as the common domain across all datasets and where all computations were carried out. b E-OBS orography over the common ALP domain. The Colour scale either represents the RCM or the E-OBS topography, where each cross corresponds to a single ECA&D station for maximum temperature and the blue dots for minimum temperature

2.2 Observations

For the assessment of the new simulations’ added value, the RCMs were compared against the E-OBS v21.0e regular gridded dataset (Cornes et al. 2018) at 0.1° (~ 11 km) resolution. This is an ensemble of a daily dataset available from the European Climate Assessment & Dataset (ECA&D; http://www.ECAD.eu) on 0.1° (~ 11 km) and 0.25° (~ 25 km) regular grids, covering the whole of Europe. The ensemble was constructed from 100 members, through a conditional simulation procedure. Each member of the ensemble uses a spatially correlated random field, which considers a pre-calculated spatial correlation function. The ensemble members consider the ECA&D station data and other non-freely available station data in its formulation. If for a particular point, there is no available data, the E-OBS uses the SYNOP system as a replacement to provide near real-time data (for more details refer to Cornes et al. 2018). The data covers a period from 1950 to 2019 and the ensemble mean, and spread is available for precipitation, daily mean, minimum and maximum temperature, as well as sea level pressure (the full ensemble can be available upon request). Here, the added value evaluation was performed using the daily minimum and maximum temperatures for the common domain across datasets (outlined by a blue colour in Fig. 1a).

Another gridded dataset is also considered in this study, the European Meteorological Observations with a regular horizontal grid with 5 km horizontal resolution (EMO-5: Thiemig et al. 2022), covering the 1990–2019 period. This dataset is built from historical and real-time observations from a large number of ground weather stations, Era-Interim/Land reanalysis and other high-resolution regional observational datasets. The EMO-5 includes variables such as precipitation, minimum and maximum temperature, solar radiation, wind speed and water vapour pressure at a daily time step. The authors considered a SHEREMAP interpolation procedure to convert the quality-controlled station into a regular grid with a 5 km resolution. The authors also performed a quality check on the new dataset, which proved the ability to capture most of the precipitation events.

Besides the gridded datasets, we also considered the ECA&D weather station observations (Klein Tank et al. 2002; Klok and Klein Tank 2009), which is the basis of the E-OBS dataset. The ECA&D dataset collects daily series of observations from weather stations within Europe and is updated at regular intervals. Basic quality control of the station data is performed by the ECA&D team. Figure 1b shows all the stations used for this study. It is important to note that aside from southern Germany, the station coverage is quite sparse. This will have an impact on the gridded product as well as the station-based product.

2.3 Distribution added values (DAVs)

SOARES and Cardoso (2018) proposed the DAV metric to quantify the added value of higher versus lower-resolution simulations, using observations as a baseline. This metric relies on a probability distribution (PDF) skill score proposed by Perkins et al. (2007) which compares the similarity between two PDFs. The first step is to build an empirical PDF for each dataset (RCMs, CPRCMs, E-OBS, ECA&D, EMO-5 and ERA-Interim) by considering the relative frequency of occurrence of a determined temperature value. Each bin is composed of the sum of all occurrences within a 0.5 °C width. We also performed a small-scale sensibility test on the bin width for the values of 1 °C, 0.5 °C and 0.25 °C. We found small differences between the 1 °C and the 0.5 °C, while the PDF was too irregular at 0.25 °C. Thus, we decided to choose a half-degree for bin width. Then, the normalization was carried out by dividing the number of occurrences in each bin by the sum of all bins. The use of a normalized PDF metric allows for easier comparison across seasons and regions, but also, one can more accurately assess changes in the distributions (Gutowski et al. 2007). The evaluation is then carried out for both the maximum and minimum temperatures individually. Here, for both temperatures the lower and upper limits are set to − 50 °C and + 60 °C respectively, encompassing the maxima and minima of all datasets. Following Perkins et al. (2007), the matching score represents the common overlap between the model’s PDF (RCMs CPRCM or reanalysis) and the observational PDF. It is computed by determining for each bin (Z) which PDF, observations, or model, has the lower value and by adding these values:

$$S = \sum\limits_{i = 1}^{n} {min\left( {Z_{{m} _{i}} ,Z_{{obs_{i} }} } \right)}$$
(1)

where \(n\) is the total number of bins and m denotes the simulation for either the high-resolution RCM and CPRCM or the low-resolution ERA-Interim. Thus, a score was computed for each simulation. As the sum of the relative frequencies in a PDF is 1, then S has a corresponding maximum value of 1 (perfect overlap) and a minimum of 0 (PDF inexistent or further apart). Since the data is normalized, the contribution of each bin to the overall score of a model decreases when approaching the tails. Also, because of the normalization, if a model displays improbably higher frequencies in one part of the PDF, it will inevitably show lower frequencies in the others, thus lowering the score. In the case where a model does not have values within a bin, but the observations do, then the contribution from this bin is zero and vice-versa. Based on the scores for RCM or CPRCM (\({S}_{hr}\)) and ERA-Interim (\({S}_{lr}\)), it is then possible to compute the DAV metric as a relative difference between the two:

$$DAV=100\times \frac{{S}_{hr}-{S}_{lr}}{{S}_{lr}}$$
(2)

DAV returns the fraction or percentage of added (positive) or lost (negative) value associated with the higher resolution relative to the lower resolution. Moreover, this metric can be computed either for the full PDF or focusing only on segments of the PDF. Following Soares and Cardoso (2018) the added value for the extreme section of maximum (minimum) air temperature was also analysed considering only the values above (below) the observational 90th (10th) percentile.

For a fair comparison, simulations and observations must be at the same resolution when comparing two gridded datasets Hence, following the EURO-CORDEX guidelines, the FPS-Convection models are conservatively remapped (Schulzweida et al. 2006) to the 0.1° (~ 11 km) regular grid of the observations so that degradation of the high resolution is minimized. For the same reason, the E-OBS is also conservatively remapped into the ERA-Interim 0.7° (~ 75 km) resolution. In this approach, the interpolation to lower-resolution grids may degrade the PDF, particularly in the tails (Prein and Gobiet 2017). Another option was to interpolate all simulations into the 0.1° (~ 11 km) regular grid of the observations. This form allows for a comparison with all grids at the same resolution however, unrealistic values may be generated through the interpolation (Ciarlo et al. 2020). Concerning the EMO-5 dataset, all CPRCMs are interpolated into the 5 km grid resolution of the observations. For the lower resolution RCMs and ERA-Interim, the EMO-5 dataset is interpolated into each model simulation native grid resolution. Whenever an interpolation is carried out, an adiabatic temperature correction was performed to ensure that all comparisons are computed at the same height, i.e., before the model interpolation, temperatures are adjusted to sea level with a constant lapse rate of 6.5 °C/km and afterwards, they are again corrected to the target grid orography assuming the same lapse rate. For the ECA&D station data, only the nearest neighbour grid point of each RCM/CPRCM and ERA-Interim is considered. This method has the advantage of not requiring an interpolation step, keeping the native resolution of all simulations. Still, an orographic correction considering the same 6.5 °C/km lapse rate from the model grid to the station is performed to ensure that temperature is always compared at the same level. If station data for a certain day is missing, the corresponding values for that day are then removed from all the models. Hence the same number of valid points and timesteps are used when building the PDFs for added value computation. This approach pools together all the information within the selected domain. First, the entire ALP domain is considered when computing the DAV, thus returning a representative value for the region. However, this approach does not allow for an analysis of spatial variability. Hence, a second view is proposed to tackle this issue. A PDF is built for the 0.7° (~ 75 km) grid box instead (the original ERA-Interim resolution) by pooling together only the information within each box, regardless of the resolution of the simulations. Subsequently, the DAVs are independently computed for each grid-boxes, thus obtaining a spatial overview of the added value. All the approaches are followed at annual and seasonal scales.

3 Results

3.1 Maximum temperatures full PDFs

This section presents the results obtained by applying the DAV metric to the daily maximum and minimum temperatures output for the FPS-Convection models, by considering either the E-OBS regular gridded dataset, the EMO-5 high-resolution gridded dataset or the ECA&D station dataset as ground truth. For all datasets involved, the full PDFs shown in the supplementary material Figures S1 and S2 are considered to build the DAV shown in Figs. 2 and 4.

Fig. 2
figure 2

Maximum temperature a Yearly and seasonal Era-Interim Perkins skill score obtained with the E-OBS, EMO-5 and ECA&D stations as reference. The following panels show boxplots of the distribution added value for b year, c DJF or winter, d MAM or spring, e JJA or summer and f) SON or autumn. In each boxplot, the low whisker denotes the minimum DAV, while the high whisker represents the maximum. The horizontal lines are, from bottom to top, the 25th, 50th and 75th percentiles of the data. From left to right within each section, each set of three boxplots represent the results against E-OBS (red), EMO-5 (blue) and ECA&D stations (green). Within each boxplot panel, the RCMvERA measures the DAV between the RCM simulations against the ERA-Interim reanalysis. CPRCMvERA measures the DAV between the CPRCM simulations against the ERA-Interim reanalysis. CPRCMvRCM measures the DAV between the CPRCM and the RCM simulations

Figure 2a displays the scores attained by the ERA-Interim against all three datasets. The higher the score, the more difficult would be for the downscaled products to reveal added value as the limit is a score of 1. Thus, small gains, even in the situation where ERA-Interim shows scores close to the unit, are relevant. The ERA-Interim results for all seasons from the three observational datasets are close, apart from summer and autumn. Under these circumstances, similar DAV values among all three observational datasets are expected. Moreover, with the higher score values for the YEAR and SON, near neutral or negative DAV from the downscaled products can be anticipated. Figure 2b displays the DAV for the annual scale. Indeed, from the score obtained by the driving simulation, the RCMs and CPRCMs reveal neutral or even a small detrimental effect, with the values bounded between − 5 and 5%. The results do not deviate significantly among datasets. Only the DAV from the ECAD for the RCMs reveals a slightly different behaviour with negative values, with the minimum at approximately − 7%. For the CPRCMs, the DAV reveal a lower variability across all simulations with a median close to 0%. As a result, when comparing the RCMs with the CPRCMs, small gains occur for the E-OBS and EMO-5 as a reference, but with the ECA&D as the baseline, the higher resolution simulations reveal positive DAV, with 75% of the models above 0%. For most cases, the CPRCMs partly correct the negative values from the RCMS by having more neutral values and as revealed with positive DAV in CPRCMvRCM. In general, the use of kilometre-scale grids can add value to the representation of maximum temperature when compared to the RCM resolutions. These gains are more relevant in simulations which displayed the lowest values at the RCM resolution. The seasonal DAV (Fig. 2c–f) follows this trend, but with different DAV ranges. Usually, the global values for the RCMs are slightly lower in comparison to the CPRCMs. For winter and spring, highlight the high RCM variability with ECA&D as a reference, which is greatly reduced for the CPRCMs. For summer and with the E-OBS and EMO-5 as references, most RCMs and CPRCMs reveal high DAV values, with the maximum above 10% in some cases. For this season, the CPRMs reveal a slightly detrimental effect relative to their lower resolution RCMs counterparts, as seen with the median below 0%. These higher values are related mostly to the poorer performance of ERA-Interim in comparison with the other seasons. For the ECA&D as a reference, the score for the ERA-Interim is higher in comparison to the E-OBS or EMO-5. Nevertheless, the value for summer is the minimum obtained among all seasons and datasets. For SON, the DAV pattern attained by the downscaling products from ERA-Interim is similar to summer, albeit with a smaller variability and values closer to 0%. In some sense, the near absence of added value obtained in almost all situations is not surprising since ERA-Interim assimilates the observational weather data and RCMs and CPRCMs do not. Figure S3 shows these results, but for the individual models.

The similarity of the results obtained between E-OBS and EMO-5 is relevant and shows that even if a higher resolution dataset is considered, the DAV values would not change significantly. As for the ECA&D station network, the results reveal an increased variability across all models. Perhaps this increase might be related to the scarcity of stations outside of Germany or since for this case an interpolation is not required, as only the nearest neighbour grid to the station point is considered. An advantage of not using an interpolation step is related to the potential of not adding uncertainties to the final result. In this case, the ability of the higher resolutions to converge to local values is assessed, relying on local temperatures from stations instead of areal averages. Nonetheless, the sparsity of the station distribution implies that not all regions of the domain are equally evaluated. This is particularly relevant for the Mediterranean areas.

The gains or losses of the high versus low-resolution simulations are due to small deviations from the model PDF to the observational PDF. In Figure S1 (left side) are all the PDFs considered in the DAV from Fig. 2a. Although the overall simulations PDF follow the behaviour of the observations, some differences still occur. These can be better assessed by Figures S4-S6 which show the ratio between the models and ERA-Interim mean bias, standard deviation and root mean square error (RMSE). Thus, a value of 1 implies no differences between the model and ERA-Interim statistics and potentially results in a neutral DAV. Values below 1 imply a better performance of the models, for the mean bias and root mean square error and a positive DAV can be expected, while the opposite occurs for values above 1. For the standard deviation, a value of 1 implies no changes from ERA-Interim to the models. Although the very high-resolution CPRCMs inherit the biases from the high-resolution RCMs, the first are still able to improve, getting closer and sometimes surpassing the driving ERA-Interim performances.

A spatial representation of the added value for the maximum temperature full PDF is shown in Fig. 3, with the regular gridded E-OBS observation dataset as a reference. These figures are built by considering each pixel of the native 0.7° resolution from ERA-Interim as a subdomain and the DAV is computed by pooling together all the information within it. Thus, the sampling of each point is independent of the other, even if the underlying variable is not spatially independent. Each panel in Fig. 3 displays the 25th, 50th, and 75th percentiles of the DAV results obtained by each model. Furthermore, Fig. 2 DAV values do not represent a spatial mean from Fig. 3, since the DAVs are computed by pooling together all data within the evaluation domain (ALPS in Fig. 2 and Era-Interim grid box in Fig. 3). Nevertheless, the spatial DAV in Fig. 3 should follow the values obtained from Fig. 2. At the spatial level, most models reveal neutral DAV values in the interior and some gains near the coastal sites as given by the median. This result is not surprising due to the small, near 0% percentages at the domain level (Fig. 2). At the seasonal scale the percentages are similar, but with gains on the Mediterranean coast above + 30%. Yet, for winter and spring, between 25 and 50% of models reveal some loss of value in the interior, namely over the ALPs. The Alps are characterized as an area of very complex terrain, where the models struggle to adequately capture the local maximum temperature PDFs, and reveal a generalized decrease of value. This is a known problem associated with snow cover and snow melt (Vautard et al. 2013; Frei et al. 2018; Terzago et al. 2017; Varga and Breuer 2020). Summer follows the same pattern, but with more evidence of the added value near the coastal sites and with the top 25% of models also revealing added value in the interior. Those higher values also occurred in Fig. 2 for the domain assessment. Previous studies such as Di Luca et al. (2013), Vautard et al. (2013), Cardoso and Soares (2022) and Careto et al. (2022b) also show added value at locations near the coast, namely for the Mediterranean region. Although the CPRCMs and RCMs use prescribed SSTs from the ERA-Interim, the higher resolution of these downscaled simulations implies that the models are better able to describe the coastal circulations, breezes and land-sea boundary with the associated differential warming and thus outperform ERA-Interim. In the interior, the impact of the land surface models is relevant in the temperature distribution and allied to the assimilation of temperature in ERA-Interim makes it difficult for the models to reveal added value.

Fig. 3
figure 3

Yearly and seasonal distribution added value (DAV) of the daily maximum air temperature for the CORDEX FPS-Convection simulations. The P25, P50 and P75 denote the 25th, 50th, and 75th percentiles of the DAV results obtained by all simulations. a RCMvERA measures the DAV between the RCM simulations against the ERA-Interim reanalysis, b CPRCMvERA measures the DAV between the CPRCM simulations against the ERA-Interim reanalysis and c CPRCMvRCM measures the DAV between the CPRCM and the RCM simulations. All simulations are assessed with the E-OBS as a reference

For the CPRCMs with the E-OBS as reference (Fig. 3b), the overall pattern is similar to Fig. 3a, with a clear added value in locations near the Mediterranean coast and more limited gains or even losses in the interior. The added value over the coast is more evident for the seasonal values. However, in this case, the DAV percentages for the entire domain are higher in comparison to Fig. 3a, particularly for the summer. Thus, most locations emerge with a high percentage, above 5%. These higher gains for the kilometre-scale simulations are highlighted in Fig. 3c which compares the CPRCMs and RCMs. At least 25% of the models reveal gains in favour of the higher resolution at the seasonal level. On the contrary, for summer, half the models display losses near the Po valley with at least 25% revealing widespread losses over Italy and the Mediterranean.

The supplementary Figures S7 to S9 display the DAV result with the E-OBS, EMO-5 and ECA&D datasets as references, respectively. The results among the three datasets are very similar even for the ECA&D stations, particularly in Germany, where the station density is higher. Yet, among the results for all observational datasets, the Mediterranean coast displays added value, contrasting with the near-neutral or even negative values in the interior. Furthermore, this spatial assessment also allows one to evaluate the gains at different height levels. Since the downscaled products are evaluated at a higher resolution in comparison to ERA-Interim, each grid cell would be associated with a certain height range. Overall, for maximum temperature the FPS-Convection simulations reveal gains near sea level, contrasting with some difficulties over the complex terrain of the ALPs and near neutral values over the rest of the domain.

3.2 Minimum temperatures full PDFs

Figure 4 shows the results for the minimum temperature across the entire domain. The higher ERA-interim scores (Fig. 4a), in comparison to the maximum temperature, imply an added difficulty for models to reveal added value. Indeed, the overall DAV values for minimum temperature are lower than those from Fig. 2. At the annual scale, more than 75% of the RCMs evaluated against the E-OBS and EMO-5 reveal losses, while all models show negative DAV for the ECA&D as a reference. The performance in terms of DAV for the CPRCMs is similar with a slight improvement for simulations evaluated by the ECA&D stations. The seasonal DAV follows a similar pattern, although with increased variability across all simulations. For winter and with the E-OBS as a reference, 75% of the simulations reveal a small detrimental effect, which is partially corrected by the CPRCMs, with 75% of the models showing improvements towards the higher resolutions. With the EMO-5 as a baseline, the results are similar, albeit more than 50% of the RCMs reveal added value. For the ECA&D all FPS-Convection simulations display losses. However, in this case, the improvement for the CPRCMs is higher in comparison with the other datasets reaching + 15%. For spring, the difference among datasets is similar, but with lower DAV for E-OBS and EMO-5. As for summer, the improvement from the CPRCMs in comparison to RCMs is non-existent for the E-OBS while for EMO-5 and ECA&D, approximately half the model’s loose value at the higher kilometre-scale resolutions. For autumn, the pattern of DAV values is similar, albeit with a smaller variability among all the models. The individual model results are displayed in Figure S10.

Fig. 4
figure 4

Minimum temperature a Yearly and seasonal Era-Interim Perkins skill score obtained with the E-OBS, EMO-5 and ECA&D stations as reference. The following panels show boxplots of the distribution added value for b year, c DJF or winter, d MAM or spring, e JJA or summer and f SON or autumn. In each boxplot, the low whisker denotes the minimum DAV, while the high whisker represents the maximum. The horizontal lines are, from bottom to top, the 25th, 50th and 75th percentiles of the data. From left to right within each section, each set of three boxplots represent the results against E-OBS (red), EMO-5 (blue) and ECA&D stations (green). Within each boxplot panel, the RCMvERA measures the DAV between the RCM simulations against the ERA-Interim reanalysis. CPRCMvERA measures the DAV between the CPRCM simulations against the ERA-Interim reanalysis. CPRCMvRCM measures the DAV between the CPRCM and the RCM simulations

A rationale for this general decrease of value from the FPS-Convection models is mostly due to two factors which complement each other. First, the ERA-Interim incorporates 2-m temperature observations from weather stations and also in situ measurements of upper-air variables from satellite data, radiosondes, pilot balloons, and aircraft, among others (Prömmel et al. 2010; Dee et al. 2011). Whilst the maximum and minimum temperatures are derived from forecasts, these variables should still converge to the observed data, which by itself makes it difficult for the downscaling products to reveal added value, hence the high scores (Figs. 2a and 4a). Another factor is related to the overestimation of the temperature bins around 0 °C (Figure S2), particularly evident for the annual PDFs and the winter PDFs. This poorer performance for minimum temperature can be associated with the simulation of snow accretion, temperature inversions due to the discretization of steep slopes and cold drainage in the valleys. An even finer resolution is required for the correct simulation of the latter phenomena. Overall, the higher-resolution RCMs can better depict the mountain ridges and valleys, while at the same time they reveal an inability to adequately represent snow cover, depth and melt. These issues impact the snow-albedo temperature feedback through the balance between latent and sensible heat fluxes leading to a misrepresentation of temperature around 0 °C (García-Díez et al. 2015; Minder et al. 2016; Terzago et al. 2017). This causes an underestimation of snow depth at lower altitudes and an overestimation at higher elevations. Given the topographic complexity of the Alps region, these issues play an important role, especially at higher resolutions and altitudes.

Figure S11 to S13 display the mean bias, standard deviation and RMSE respectively, from all simulations against all considered observational datasets, complementing the PDFs shown in Figure S2. Indeed, the Era-Interim reanalysis reveals a remarkably good performance with smaller biases and a close variability to the E-OBS dataset. Thus, it is expected an added difficulty for the downscaling products to reveal a better performance than the driving reanalysis. In comparison to maximum temperature (Figure S3), the models reveal a systematic poorer performance than Era-Interim hence the overall loss of value. Nevertheless, the range of RMSE differences is similar to the maximum temperature, particularly for E-OBS and EMO-5. However, the standard deviation reveals a distinct perspective. Indeed, for minimum temperature (Figure S12) the variability of the FPS-Convection models relative to observations is much higher. The lower DAV percentages for minimum temperature in comparison to the maximum temperature were also found in Careto et al. (2022b) for the EURO-CORDEX hindcast simulations over the Iberian Peninsula. Furthermore, when analysing the added value for the greater Alpine region, Prömmel et al. (2010) obtained comparable results.

Figure 5 displays the spatialization of the DAV results for minimum temperature. From the negative DAVs across the seasons and at the yearly timescale (Fig. 4a) one could expect losses in the entire domain in Fig. 5a. However, as with maximum temperature, some points located primarily over Italy and particularly over the Mediterranean coast show improvements for all seasons, except summer. Once again, the better representation of the land-sea contrast implies added value. Conversely, it is hard for the RCMs to add value over flat terrain far from the coastlines, especially against ERA-Interim, which assimilates temperature observations, or over complex terrain due to the snow-albedo-temperature feedback and other uncertainties from land surface models. For all seasons, apart from summer, more than 50% of the models reveal neutral values in the interior and gains on the Mediterranean coast, namely for winter over Italy. For summer albeit the high gains near the coast, with values above + 30%, at least 25% of the models reveal relevant losses in the order of -20% or below across the entire domain.

Fig. 5
figure 5

Yearly and seasonal distribution added value (DAV) of the daily minimum air temperature for the CORDEX FPS-Convection simulations. The P25, P50 and P75 denote the 25th, 50th, and 75th percentiles of the DAV results obtained by all simulations. a RCMvERA measures the DAV between the RCM simulations against the ERA-Interim reanalysis, b CPRCMvERA measures the DAV between the CPRCM simulations against the ERA-Interim reanalysis and c CPRCMvRCM measures the DAV between the CPRCM and the RCM simulations. All simulations are assessed with the E-OBS as a reference

The CPRCMs shown in Fig. 4b reveal a comparable behaviour relative to the RCMs. Still, some differences exist, with 25% of the models displaying higher gains near the shore. Moreover, for winter, 25% of the models also reveal gains over the ALPs. Indeed, the benefit of the CPRCM is precisely over regions of complex topography, and near the shore. On the contrary, in summer, at least 25% of the kilometre-scale models have poorer performance in comparison to the respective RCMs counterparts. Those lower DAV occur over central Italy. Figures S14 to S16 show the results for the individual models against all three observational datasets. Nevertheless, the values obtained with the ECA&D should be viewed with care as the low station coverage outside of Germany can add uncertainty to the analysis. Still, as with maximum temperature, the values are similar across the E-OBS, EMO-5 and ECA&D as reference.

3.3 Maximum and minimum temperature tails (or extremes)

Figures 6 and 7 display the added values regarding the extreme tails of maximum and minimum temperature, respectively. In this case, the PDFs are built considering only the maximum (minimum) temperatures above (below) the 90th (10th) observational percentiles given in Figure S1 (S2). The versatility of DAVs, being able to be applied to PDF sections, allows for an assessment of the added value related to a segment, which may not stand out when considering the entire PDF.

Fig. 6
figure 6

Maximum temperature extremes, only considering the temperatures above the 90th percentile of maximum temperature observations a Yearly and seasonal Era-Interim Perkins skill score obtained with the E-OBS, EMO-5 and ECA&D stations as reference. The following panels show boxplots of the distribution added value for b year, c DJF or winter, d MAM or spring, e JJA or summer and f SON or autumn. In each boxplot, the low whisker denotes the minimum DAV, while the high whisker represents the maximum. The horizontal lines are, from bottom to top, the 25th, 50th and 75th percentiles of the data. From left to right within each section, each set of three boxplots represents the results against E-OBS (red), EMO-5 (blue) and ECA&D stations (green). Within each boxplot panel, the RCMvERA measures the DAV between the RCM simulations against the ERA-Interim reanalysis. CPRCMvERA measures the DAV between the CPRCM simulations against the ERA-Interim reanalysis. CPRCMvRCM measures the DAV between the CPRCM and the RCM simulations

Fig. 7
figure 7

Minimum temperature extremes, only considering the temperatures below the 10th percentile of maximum temperature observations a Yearly and seasonal Era-Interim Perkins skill score obtained with the E-OBS, EMO-5 and ECA&D stations as reference. The following panels show boxplots of the distribution added value for b year, c DJF or winter, d MAM or spring, e JJA or summer and f SON or autumn. In each boxplot, the low whisker denotes the minimum DAV, while the high whisker represents the maximum. The horizontal lines are, from bottom to top, the 25th, 50th and 75th percentiles of the data. From left to right within each section, each set of three boxplots represents the results against E-OBS (red), EMO-5 (blue) and ECA&D stations (green). Within each boxplot panel, the RCMvERA measures the DAV between the RCM simulations against the ERA-Interim reanalysis. CPRCMvERA measures the DAV between the CPRCM simulations against the ERA-Interim reanalysis. CPRCMvRCM measures the DAV between the CPRCM and the RCM simulations

The maximum temperature extremes added value, shown in Fig. 6 where the E-OBS dataset is the baseline, displays a different pattern when compared to Fig. 2. First the score attained by the ERA-Interim is lower when compared with Fig. 2a, particularly for the winter season. At the annual scale (Fig. 6b), the DAVs from at least half the RCMs are negative for all observational datasets, with a significant difference between the minimum value and the 25th percentile, displaying − 25% and − 5% respectively. In this case, the CPRCMs are not able to correct those losses, and most models reveal a loss of value. For winter, the panorama differs. The lower ERA-Interim score provides room for improvements from the FPS-Convection models. Indeed, all simulations evaluated against E-OBS, and EMO-5 display higher gains, surpassing + 20% and more than half the RCMs display DAV above + 15%. For the ECA&D, the results are more neutral for the RCMs with the kilometre-scale showing improved performance. For spring, the results obtained are closer to those from Fig. 2d, although with lower or even negative values. In contrast with winter, summer displays the exact opposite with lower and negative values for all models with the E-OBS and EMO-5 as references while ECA&D reveal higher DAV percentages, still below 0 for more than half the RCMs. This significant detrimental effect is in line with Vautard et al. (2013) and can also be associated with a strong dry bias all models exhibit in summer (Ban et al. 2021). On the contrary, for autumn, most simulations have an added value, albeit not as evident compared to winter. Overall, for the extreme maximum temperature, apart from the results from the ECA&D stations, the kilometre-scale simulations tend to lose value relative to the respective forcing RCMs.

The cold extremes DAV are shown in Fig. 7. Among all cases previously assessed (Figs. 2a, 3a and 6a), the score for the ERA-Interim from all three datasets is higher. The exception is for summer and with the ECA&D as a reference. Also, as expected from Figs. 7a and S2 a general loss of value occurs for all simulations at the annual scale (Fig. 7b). Still, the CPRCM are able to add value, partly correcting the results from the RCMs. Yet this improvement is not sufficient as almost all FPS-Convection simulations display detrimental effects. At the seasonal scale, the results for winter, spring and autumn are similar to the annual scale. For winter, the models reveal the worst performance. The 10th percentile threshold of the observations is well below 0 °C, the temperature at which the issues related to the snow-albedo-feedback occur. Still, models display a loss of value possibly owed to the good performance of ERA-Interim (Fig. 7a). On the other hand, for the intermediate seasons, the percentile threshold is close to 0 °C, particularly for autumn. In this case, the loss of value could be associated with the issues related to snow cover, melt and depth. For summer the PDFs in Figure S2 do not reveal these issues, albeit negative percentages still occur. In this case, the score of ERA-Interim is higher for the E-OBS, lower for the EMO-5 and considerably lower for the ECA&D stations. These differences are relevant in the added value of the summer season as FPS-Convection models reveal positive gains for the ECA&D and negative DAV with the E-OBS as a reference.

The spatial structure of the hot extreme DAV is shown in Fig. 8. The evaluation for the RCMs with the E-OBS as a reference is shown in Fig. 8a. Most models exhibit grid points where the RCMs perform better than ERA-Interim, namely in winter, and as expected from Fig. 6a. Moreover, the gains are focused on the Mediterranean coastal areas for all seasons, identical to the evaluation by considering the entire PDF (Fig. 4). While only winter displays the largest positive values in Fig. 6, all seasons in Fig. 8a reveal most locations with added value. For instance, regions of France for spring, summer, and autumn. Following Fig. 6a, summer is the season where models have more difficulty in obtaining added value. The losses are particularly evident over the Po valley and around the Adriatic Sea. This is a known hotspot for convective activity during the warmer seasons. Thus, since deep convection is parametrized in the RCMs larger uncertainties are expected for precipitation, which in the end can feedback into the temperature values. Similarly, the kilometre-scale simulations shown in Fig. 8b reveal a very similar pattern to the previous domain assessment. As in Fig. 6a, the winter season is by far the best performing, with gains for almost all models and regions. As for the intermediate seasons, all RCMs display similar inter-model behaviour in comparison to winter. The different performances across RCMs and CPRCMs are highlighted in Fig. 8c, which shows a slight detrimental impact when considering the highest resolution. These losses occur for more than half the models, mostly for the summer season. Nevertheless, the improvement found for the FPS-Convection models near the coast surpasses the DAV obtained for the full PDF (Fig. 3). All gains in the tail distribution of maximum temperature are especially relevant, due to a possible better characterization of extremes of temperature and the related heat waves. As before, Figures S19, S20 and S21 display the spatial DAV against all three respective observational datasets for the individual models.

Fig. 8
figure 8

Yearly and seasonal distribution added value (DAV) of the daily maximum air temperature only considering the values above the observational 90th percentile for the CORDEX FPS-Convection simulations. The P25, P50 and P75 denote the 25th, 50th, and 75th percentiles of the DAV results obtained by all simulations. a RCMvERA measures the DAV between the RCM simulations against the ERA-Interim reanalysis, b CPRCMvERA measures the DAV between the CPRCM simulations against the ERA-Interim reanalysis and c CPRCMvRCM measures the DAV between the CPRCM and the RCM simulations. All simulations are assessed with the E-OBS as a reference

Figure 9 displays the spatialization of the DAV results for the cold extremes, i.e., only for values below the observational 10th percentile. While for the whole PDF, the DAV are mostly negative, when assessing the results at a more local scale, the situation differs. Overall, more than 50% of the RCMs reveal added value for most of the domain. In contrast, at the annual and winter season scales, at least 25% of the models display a generalized loss of value, except near the Mediterranean shore. For the CPRCMs, the DAV values are similar (Fig. 9b). This similarity is proved in Fig. 9c which compares the CPRCMs against the respective driving RCMs. These values contrast with those attained in Fig. 7. In fact, for all cases, the spatial DAVs, i.e., the evaluation at a more local scale is able to achieve higher DAV values.

Fig. 9
figure 9

Yearly and seasonal distribution added value (DAV) of the daily minimum air temperature only considering the values above the observational 10th percentile for the CORDEX FPS-Convection simulations. The P25, P50 and P75 denote the 25th, 50th, and 75th percentiles of the DAV results obtained by all simulations. a RCMvERA measures the DAV between the RCM simulations against the ERA-Interim reanalysis, b CPRCMvERA measures the DAV between the CPRCM simulations against the ERA-Interim reanalysis and c CPRCMvRCM measures the DAV between the CPRCM and the RCM simulations. All simulations are assessed with the E-OBS as a reference

4 Discussion and conclusions

Overall, the gains from running higher-resolution simulations for maximum and minimum temperatures are limited, across the participating FPS-Convection simulations against the ERA-Interim reanalysis. For instance, greater Alpine domain added value displays a loss of value, particularly for minimum temperature, with exception of some isolated models or seasons, similar to the results from Cardoso and Soares (2022). In this case, since the entire domain is being considered, locations, where the models reveal an added or loss of value, are mixed, affecting the PDFs. A possible explanation for the lower performances of the FPS-Convection models could be in part attributed to the interpolation and the subsequent orographic correction with a constant lapse rate. Temperature lapse rates in complex terrain have significant deviations, from the climatic lapse rate, on a daily to seasonal basis and are not uniform with altitude (Minder et al. 2010; Sheridan et al. 2010). The assumption of a constant lapse rate can induce errors of the order of ~ 1 to 2 °C (Sheridan et al. 2010). Thus, a more accurate lapse rate would improve all the orographic adjustments. A simple method using data from neighbouring grid points to estimate each grid point’s daily lapse rate could be pursued in a future analysis to alleviate the problem. Nevertheless, for a fair assessment, the comparison must be performed always at the same height, either for gridded data or local stations. However, one must also consider that the high resolution is interpolated from kilometre-scale to 0.1o and is not evaluated at its native high resolution, thus some discrepancies are expected, particularly in the tails since some smoothing of the fields is expected, leading to the loss of relevant details. This is supported by the slightly different results of the models obtained with the ECA&D stations as a reference, namely in the extremes. It is also worth noting that the E-OBS gridded dataset is mostly based on the ECA&D stations which are sparsely distributed across the domain, namely in the highest peaks and in Mediterranean regions. Large uncertainty is thus associated with the E-OBS temperatures in the Alpine region and its surroundings since it is primarily conditioned by station density. Thus, a kilometre-scale gridded dataset covering the entire Alpine domain at 5 km resolution was also considered. However, the obtained results are similar to those from the E-OBS as a reference. This similarity in the end could imply a small impact of the interpolations onto the final DAV value.

The description of maximum temperature is considerably better than minimum temperature, this can be related to the misrepresentations of snow depth, melt and cover (Terzago et al. 2017; Kotlarski et al. 2014) and the interactions between the snow and the atmosphere. Misrepresentation of this feedback affects the albedo (snow-albedo feedback) and the energy balance at the land surface, and consequently calculation of the near-surface temperature in the models which is especially evident in winter in mountain regions (Minder et al. 2016). Although using different versions of the models evaluated here, Terzago et al. (2017) found an overestimation of snow depth on the ridges inducing lower temperatures which can extend beyond the winter months. In summer, the development of convective clouds is a common occurrence due to land–atmosphere feedbacks. Ban et al. (2021) found that although CPRCMs improve the representation of precipitation, almost all models have a dry bias which is larger than the RCM. The lack of soil moisture not only contributes to a lower frequency of convective precipitation but also an overestimation of the higher temperature extremes. Hence the negative added value in maximum temperature above the 90th percentile.

When looking from a spatial perspective, the finer resolution of CPRCMs allows a higher spatial detail of temperature and other variables, often correcting the negative impact of intermediate resolutions and could in the end add value, even when interpolated to the 0.1° (~ 11 km) resolution of the E-OBS or the 5 km from the EMO-5 gridded datasets. Although one can see a loss in most regions and models, there is added value in e.g., southern European coastal regions. These consistent gains over the Mediterranean coasts hint at improved coastal dynamics, derived from the increase of resolution, either from a better-resolved topography near the coast or by a better representation of the ocean-land contrasts on atmospheric flow. Although the very high-resolution CPRCMs inherit the biases from the high-resolution RCMs, those are still able to improve, getting closer and sometimes surpassing the driving ERA-Interim performances. The gains for the use of kilometre-scale resolution for temperature are limited in this context, which highlights the need for an assessment of the applicability of parametrizations, designed for coarser resolutions, to kilometre scales. Nevertheless, other variables such as precipitation, usually reveal an important added value for the higher resolutions, particularly relevant for the precipitation extremes (Bauer et al. 2011; Prein et al. 2013; Warrach-Sagi et al. 2013; Torma et al. 2015; Rummukainen 2016; Soares and Cardoso 2018; Ciarlo et al. 2020; Lind et al. 2020; Schwitalla et al. 2020; Ban et al. 2021; Careto et al. 2022a). The improvement of the representation of convective cells in tandem with a better description of snow accretion, soil moisture and groundwater would enhance our understanding of the processes that would contribute to the development of better parameterisations and enhance the reliability of the simulations. Notwithstanding, the improvement of the spatial representation of local terrain, atmospheric circulations and land–ocean-atmosphere interactions suggests that the ambition to use increasingly higher-resolution simulations is worth pursuing.