1 Introduction

Changes in the frequency and intensity of climatic heat extremes have important impacts on sectors such as agriculture, energy demand, transportation industry, health. For this reason a realistic assessment of future climate is fundamental in order to understand the challenges ahead. The climate models used for the production of future projections are also used to simulate the climate of the recent past. This is done by taking as input verified historical observed boundary conditions (e.g. sea surface temperature, land use) and using observed values as internal (artificial and natural, such greenhouse gases and aerosol concentration, volcanic eruptions) and external forcings (e.g. solar irradiance). The historical climate simulations are crucial for the validation of projections of the future, as they can be compared with observations. Flato et al. (2014). The Climate Model Intercomparison Project 3 (CMIP3) (Meehl et al.2007) and the CMIP5 (Taylor et al. 2012) have contributed to collecting and comparing all the available climate simulations and projections, with a continuous improvement in the use of historical forcings and in the temporal coverage (Taylor et al. 2012; Flato et al. 2014). The CMIP5 experiments showed a clear improvement in model performance on temperature simulations compared to CMIP3 (Flato et al. 2014). More recently the project PRIMAVERA (a European Union Horizon 2020 project) has worked in the frame of the CMIP6 HighResMIP protocol, coordinating a set of experiments designed to assess both standard and enhanced horizontal-resolution simulations in the atmosphere and ocean (with up to 0.25 °C in the atmosphere) Haarsma et al. (2016).

The comparison of these simulations against observations is a powerful tool for the assessment of how the models reproduce the climate under observed conditions and forcings. The evaluations performed in recent years have focused on the intercomparison of the models or on the comparison with reanalyses, observed individual series or, more recently, gridded observational datasets (Flato et al. 2014). The choice of the used reference is fundamental and needs to be performed with care (Sillmann et al. 2013). The use of observations is preferable (Gleckler et al. 2008; Flato et al. 2014) and while the use of individual station data is a very direct approach, it is hampered because the gridbox value of a model represents an area-average whereas the station observation is a point-value. This is particularly problematic for climatic extremes.

This issue is avoided by using gridded observations which remove observational noise and provide a homogeneous spatial distribution (Cornes and Jones 2013). Nevertheless, the gridded datasets, especially on a daily resolution, have been deficient for a long time (Kiktev et al. 2003; Kharin et al. 2005) and the possible comparisons are still limited in space and time (Flato et al. 2014). In addition, the underlying station data may have non-climatic signals due to changes in location, management or observation techniques which can cloud a trend-based evaluation of models. The E-OBS gridded dataset (Haylock et al. 2008; Cornes et al. 2018), used in this study, matches the high spatial and temporal detail of this new generation of models and is based on a set of homogenized station data, see Sect. 2.2 for more details.

The model evaluations performed by Kharin et al. (2005) and, more recently, Sillmann et al. (2013) have shown that climate simulations reproduce time averaged values, like monthly means, better than the extreme ones. However the greatest impacts on economy and society are related to extreme events rather than averages. Furthermore, while it is generally accepted that climate changes are driven by average values (Scherrer et al. 2005), debate has taken place about the magnitude of the changes in variability (Alexander et al. 2006; Simolo et al. 2010; Morak et al. 2011; Donat and Alexander 2012), which are shown to lead to amplified effects on extreme events (Katz and Brown 1992; Schär et al. 2004) and on the indices used to describe them (Della-Marta et al. 2007). A simple shift in the mean insufficiently explains particular record-breaking events, such as the heatwave of 2003 (Schär et al. 2004).

Early model evaluations (Kharin et al. 2005, 2007; Sterl et al. 2008) have focused on hard extremes, whose trends lack significance, since they have a return time of the order of years (Frich et al. 2002). Such indices present a high interannual variability, which makes it difficult to calculate trends on relatively short periods. Furthermore, these indices are very sensitive to quality issues in the observational data, especially if the quality issues occur at a station in a data sparse area. Indices calculated using percentile-based thresholds are good alternatives to focus on climatic extremes. Two examples are TN10p, percentage of days with daily minimum temperature below the 10th percentile and TX90p, percentage of days with daily maximum temperature above the 90th percentile (ETCCDI 2009). In both cases the thresholds are calculated considering the data of the series in a long reference period, usually thirty years. This makes them site specific (the thresholds change for each grid point), and not affected by biases and applicable to any climate (Klein Tank and Können 2003; Kiktev et al. 2003; Sillmann et al. 2014b).

The 5th International Panel on Climate Change (IPCC) assessment report indicated progress for CMIP5 over CMIP3 in the bias of the models (Flato et al. 2014). The multi-model average of CMIP5 presented a general agreement with observations, with the exception of a cold bias in winter months over Northern Europe (Flato et al. 2014). Nevertheless, relevant criticisms have been found when analyzing the trends of average and extreme values. The study of Bhend and Whetton (2013) found a significant underestimation of the trends in summer average temperatures over Southern Europe and the Mediterranean. At the same time, Min et al. (2013) assessed the simulation of TXx (warmest maximum temperature in a year) over North Western Europe performed by regional models. This last work has stressed the fact that important aspects of the occurrence of moderate maximum temperature extremes over North Western Europe are missed by a set of regional models. The poor simulation of the summer extremes in the present climate stimulates questions about the reliability of the same models in their predictions of extreme events in the next decades (Bhend and Whetton 2013). The new homogenized gridded dataset over Europe (referred to as E-OBSv19.0HOM) provides a spatial and temporal coverage that allows an evaluation of the global high resolution models that have been made available in the PRIMAVERA project. In contrast to the previous studies on the same topic, the focus will be on the trends of extreme events such as TN10p and TX90p, with a particular attention to seasonal values to some European subregions. This work provides for the first time a complete European-wide assessment of the performance of the considered models in simulating temperature trends, including an assessment of the improvement of the high resolution version in comparison to their lower resolution counterpart. Such an approach is taken following other works that have evaluated the progresses of HighResMIP in the analysis of other phenomena, such as tropical cyclones (Roberts et al. 2020). It is important to stress that this study does not aim at identifying the reasons behind the observed discrepancies. This will be the subject of future projects that will use the results of this work as a starting point for the diagnosis of the criticisms in the models.

2 Data and methods

2.1 Models

The tested simulations have been developed in the framework of the PRIMAVERA Project, that aims at increasing spatial resolution of climate models. Six models have been analyzed at high resolution (HR) (see Table 1) version and in a previously existing lower resolution (from now on referred to as lower resolution, LR) (see Table 2), focusing on the period from 1970–2014 (see Sect. 2.2) and considering the region enclosed between 22 ° W and 50 ° E and 20 ° N and 76 ° N. The variables considered in this study are minimum temperature (TN) and maximum temperatures (TX) at a daily timescale. Each model taking part in PRIMAVERA has contributed with several experiments, the one that has been used for this work is named ”highres-SST-present”. This consists of a simulation of the atmospheric conditions over the period 1950-2014, taking observed sea surface temperature, sea-ice concentration and incoming radiation as forcings. Each model has a different spatial resolution and a different number of ensemble members. Tables 1 (HR) and 2 (LR) summarize the characteristics of the used models and the availability of the ensemble members as of 23rdof September, 2019.

The ECMWF model has native resolution Tco399 (∼ 25 km) for HR and Tco199 (∼ 50 km) for LR. In the frame of PRIMAVERA they have been provided in a regridded version, respectively to 0.25° and 0.5° constant latitude-longitude regular grids, more details in Roberts et al. (2018). The EC-Earth3P model runs at the resolution TL511 for HR and TL255 for LR on a non-regular latitude-longitude grid. The scripts used for the indices calculation (2.3) require regular grids, making it necessary to regrid the EC-Earth3P model.

Table 1 Characteristics of the high resolution models that have been used
Table 2 Characteristics of the low resolution models that have been used

2.2 E-OBS

The reference used for the evaluation of the models is the E-OBS for TN and TX (Haylock et al. 2008; Cornes et al. 2018). It comes as a 100-members ensemble, whose spread increases in areas with low station density, indicating a larger uncertainty. In this work only the ensemble mean is considered. E-OBS is based on the station data of the European Climate Assessment & Dataset (ECA&D) (Klein Tank et al. 2002), which collects data of thirteen variables from more than 19,000 stations located in all countries of the European and Mediterranean region. Almost 10,000 of these stations include temperature data. These are provided by National Meteorological and Hydrological Services, universities or private companies and range from late 18thcentury to current times. However, relocation of stations, instrumentation changes and variations in the surroundings of the meteorological stations affect the quality of ECA&D temperature temporal series related to such stations (and therefore E-OBS), reducing the reliability for temporal analyses. For this analysis, a modified version of E-OBS is constructed based on recent work on the homogenization of the temperature series of ECA&D (Squintu et al. 2019, 2020). These studies describe how a large part of the inhomogeneities have been removed, making it possible to smoothly combine series that belong to neighbouring stations and to gather the data into one long-running homogeneous series, called blended series. This process considerably improves the input data for E-OBS, which becomes a data set of long and homogenized series: a prerequisite for a thorough climatic change assessment (ETCCDI 2009; Jones and Wigley 2010).

For the purpose of this work, only the blended series that start before 1970 and that stop after 2014 have been considered in the construction of a special version of E-OBS, called E-OBS.hom. This selection aims at having a constant number of blended series contributing to each grid-point, avoiding changes in station density that might introduce inhomogeneities. Table 3 explains that there is not a drastic change in the number of blended series choosing 1970 or 1980 as starting point, thus 1970 has been chosen in order to work with a longer period.

Table 3 Number of series which are continuously active during the indicated time interval

2.3 Data analysis

In this work minimum temperatures and maximum temperatures have been analyzed, focusing on the seasons: winter (December, January and February; DJF), spring (March, April and May; MAM), summer (June, July and August; JJA) and autumn(September, October and November, SON). Furthermore, all the results have been summarized taking means over the whole domain and on six relevant regions: Iberian Peninsula, Southern, Eastern, Western, Central and Northern Europe. The boundaries of these regions can be observed, for example, in Fig. 1. Even though climate phenomena obviously happen across political or statistically determined boundaries, these areas have been identified as those involved by common peculiarities for a large number of the analyzed parameters.

While all the values are available in the tables of Appendices A and B (Tables 4, 5, 6), the figures and the discussion focus on the indices TN10p-DJF and TX90p-JJA. These represent the coldest and warmest events, namely those with highest impact on health, economy and society. After checking the bias of the seasonal averages (e.g. TNavg-DJF and TXavg-JJA), for each grid-point the indices TN10p and TX90p have been calculated on a seasonal level. In all cases the percentile thresholds have been calculated over the 1981-2010 period, making use of the bootstrapping approach introduced by Zhang et al. (2005).

In order to perform a grid-point by grid-point comparison the E-OBS indices have been regridded with a bilinear procedure to the native grid of each model (with the exception of the ”substitute” grid used for EC-Earth3, see Table 1), creating six versions of remapped E-OBS for each index.

At this point for each season, dataset and grid-point, the trends on the indices on the 1970–2014 period have been obtained.Footnote 1 Calculation of trends has been done following the Sen’s slope method (Sen 1968), which is more robust than a least square approach and does not require the assumption of a normal distribution (Sen 1968; Alexander et al. 2006; ETCCDI (2009). Some model experiments have been run in ensemble mode and, in order to obtain the ensemble means, for each model the trends on each grid-point related to the ensemble members have been averaged. Each ensemble mean has been compared to the corresponding E-OBS regridded dataset, taking the difference of the trends on each grid-point. The difference has been considered significant when the 95% interval of each trend on E-OBS and the 95% interval of each corresponding trend on the model don’t overlap. This process, applied to both high and low resolution versions, has allowed us to detect areas in which the models underestimate or overestimate the trends that have been seen in observational datasets.

Finally, the absolute trend bias is defined as the unsigned difference between the trend in the model and the trend in the E-OBS dataset. This operation has been applied to HR and corresponding LR versions of each model in order to compare them. For this purpose, a new temporary dataset has been created using the LR values on the grid resolution of the HR model (LRtoHR). These LRtoHR grid-points have been filled by using the absolute trend bias of the LR grid-point that overlaps with the LRtoHR grid-point. This is done in order to better inspect the local impact of increasing resolution, which would be lost in case the comparison was performed regridding the HR to the LR. The HR and the LRtoHR absolute trend biases have been compared by taking the difference as shown in eq. 1 .

$$ \begin{gathered} abstrendbias_{{HR}} - abstrendbias_{{LR}} \hfill \\ \quad = \left| {trend_{{HR}} - trend_{{E - OBS}} } \right| - \left| {trend_{{LR}} - trend_{{E - OBS}} } \right| \hfill \\ \end{gathered} $$
(1)

If this metric produces a negative value, then the HR absolute trend bias is lower than the one of LR, thus the trend for HR is closer to the observed one, indicating a better performance. On the other side a positive value indicates that the HR’s performance is worse than the corresponding LR. The aim of using absolute trend biases is to assess if the HR trends are closer to the E-OBS trends than corresponding LR trends, independently from the sign of the trend difference. If the comparison was performed with trend biases, it would have only communicated if the HR models simulate warmer (positive result) or colder (negative result) trends than the corresponding LR models, see Eq. 2.

$$ \begin{gathered} trendbias_{{HR}} - trendbias_{{LR}} \hfill \\ \quad = \left( {trend_{{HR}} - trend_{{E - OBS}} } \right) - \left( {trend_{{LR}} - trend_{{E - OBS}} } \right) \hfill \\ \quad = trend_{{HR}} - trend_{{LR}} \hfill \\ \end{gathered} $$
(2)

3 Results

3.1 Minimum temperatures

3.1.1 Bias in winter averages

The considered HR models show strong differences in the reproduction of TNavg-DJF, see Fig. 1. The largest mean biases on continental and regional levels are found for CMCC (+2.96 °C), while EC-Earth3, ECMWF and HadGEM3 underestimate the minimum temperature on almost all regions. For all models West presents the largest (or least negative) biases, while within North there is clear contrast between the Norwegian coast and the interior of Sweden and Northern Finland. Strong cold biases are observed over Norway and Italy, and—less pronounced—in the Balkans and in the surrounding regions. MPI and CNRM perform best in terms of mean biases and present considerably lower extension of the shaded area. These are present when the simulated TNavg-DJF is significantly different from the observed one (i.e. absence of overlap between the 95% confidence interval of the two terms of the difference).

3.1.2 Trends in winter averages

Trends on the TNavg-DJF of the models in the 1970-2014 period are compared against the same indices of E-OBS. All models reproduce very well the trends in winter TN. The mean trend biases, Fig. 2, ranges between − 0.16 °C per decade (°C/dec) (CNRM) and +0.02 °C/dec (ECMWF). This indicates a tendency in simulating lower trends over the continent, especially in East and North, that always show negative biases. Nevertheless recurring positive biases are found over the Kola Peninsula (NorthWestern Russia, 6 models out of 6), together with Iberian, Southern and Central Europe, that present warm biases for all models.

3.1.3 Trends in cold extremes

Trends in winter cold extremes as TN10p-DJF are more challenging and Fig. 3 shows the performance of the HR version of the models compared to the E-Obs (simple difference). While HadGEM3 (mean trend bias: − 1.25 %/dec) and, less strongly, ECMWF (mean trend bias: − 0.95 %/dec) simulate a lower (thus warmer than the observed) trend of number of days below the 10th percentile in all the regions, CMCC,CNRM and EC-Earth present a contrast between Eastern Europe and the other regions. Average biases in Iberia, South and Center are negative and below continental averages for all models but MPI, furthermore a considerable number of gridpoints with significant differences are detected, in particular for CMCC, indicating a poor representation of the trends in these areas. The performance of MPI differs from the other models, not showing pronounced patterns, with the exception of having a too strong warming in TN10p-DJF over Sweden and Norway, in common with three other models.

Eventual inconsistencies between the biases in the trends of the average and of the extreme indices allow us to inspect how the models reproduce the changes in the shape of the distribution of the temperatures. In the case of minimum winter temperatures general agreement (excluding MPI) is observed in the overestimation of the trends for Iberia, South and Center. Nevertheless, for models like CMCC and HadGEM3 the presence of significant biases in the TN10p-DJF trends over Iberia and other areas imply that their simulation of cold extremes counts are considerably less events than observed. This reveals the tendency for the simulation of a narrower distribution. Same consideration can be made about Southern and Northern Europe (excluding Finland), where the good representation of the trends of the average contrasts with the general warmer bias related to the trends in the extremes. Finally Eastern Europe presents colder simulated trends for both average and extreme values in four models, while ECMWF and HadGEM3 show disagreement indicating, again, a tendency to a narrower distribution than observed.

The patterns in the trend bias in the HR models can be compared to the low resolution models, whose results are shown in Appendix C. Figure 9 shows that the LR models present similar patterns in the trend biases as the HR models.

Figure 4 presents the difference in absolute trends biases of TN10p-DJF between HR and LR, see Sect. 2.3. Negative values (green) indicate that HR has lower absolute trend bias than LR for that specific grid-point, thus it is performing better. Only CMCC clearly shows an area with worse absolute trend biases over West, North and Center (where very large trends are simulated), which contrasts with the strong performance of the same model over Eastern Europe. Despite this, the mean absolute trend biases over the whole continent are reduced for almost all the models, indicating a general improvement in the description of the cold extremes between low and high resolution. The best improvement is found for HadGEM3 (-0.51%/dec, especially over Central Europe, − 1.12%/dec), while the only worsening, out of the considered models, is for ECMWF (+ 0.17 %/dec) whose LR version is found to perform the best among the others, see Appendix C. The model with the lowest mean absolute trend bias in high resolution is MPI (0.61 %/dec).

Fig. 1
figure 1

Difference in the winter average of TN between the HR models and E-OBS. Red indicates overestimation, blue indicates underestimation. Significant differences are indicated by small thin circles for each grid-point, which result in shaded areas

Fig. 2
figure 2

Difference in the trends of TNavg-DJF for HR models. Red indicates overestimation (warmer simulated trends), blue indicates underestimation (colder simulated trends). Significant differences are indicated by black circles

Fig. 3
figure 3

Difference in trends of TN10p-DJF between the HR models and E-OBS. Red (blue) indicates an underestimation (overestimation) of the trend, related to a warmer (colder) trend in the model

Fig. 4
figure 4

Difference in absolute trend bias of TN10p-DJF between HR and LR models. Red (green) pixels indicate an increase (decrease) of the absolute trend bias, thus a better (worse) performance

3.2 Maximum temperatures

3.2.1 Bias in summer averages

The comparison of summer maximum temperatures has started from an evaluation of the biases of the models. Four of them give a mean bias that has a lower absolute value compared to what is observed for summer maximum temperatures, see Fig. 5. CMCC presents a large underestimation (− 3.83 °C), similar, but with opposite sign, to the corresponding result for TN. The remaining 5 models show similar patterns, with a warm bias along the Mediterranean and Black Sea coast and a general underestimation over North and Center, together with the Northern part of East. A large overestimation common to all models is found in Northern African regions, influencing the mean bias over Iberia, which has lower values on the Atlantic Coast. Nevertheless these large biases (above +10 °C) can be in part related to the high uncertainty of E-OBS over Morocco and Algeria, due to a lower station density.

3.2.2 Trends in winter averages

The continental average of the difference in trends of TXavg-JJA ranges between − 0.17 °C/dec (ECMWF) and + 0.03 °C/dec (CMCC), see Fig. 6. The models tend to slightly underestimate the warming of summer temperatures. This is more evident over Iberia, South and East, whose regional mean biases are always similar or lower than the European mean. Especially in the case of EC-Earth and ECMWF, large areas with significant difference are observed, implying an inaccurate reproduction of the changes in the climate of these areas. On the other hand almost all the models tend to overestimate the trends over Northern Europe (especially Southern Norway).

3.2.3 Trends in warm extremes

Figure 7 shows the difference in trends between model and observations in TX90p-JJA, related to warm temperature extremes. The European mean biases show a large underestimation of the trends for EC-Earth (− 0.73 %/dec), ECMWF (− 0.59 %/dec) and HadGEM3 (− 0.56 %/dec). In all cases stronger trends, consistent with those found for the trends in the averages, are simulated over Northern Europe, in particular Norway and Sweden. This behaviour is partially observed over Center and West and contrasts with the general underestimation of trends over Iberia, South and East, simulated by all models but CMCC. In these areas large significant differences are found in particular for EC-Earth3, ECMWF and HadGEM3. This aspect (as found for the simple seasonal averages as well) indicates the tendency to reproduce lower trends of warm extremes on the Mediterranean and Black Sea region and slightly larger ones around Northern Sea. In contrast to the other models, CMCC doesn’t present this pattern, showing an overestimation of the trends in all regions with the exception of Eastern Europe. These biases are consistent with those observed for the trends of the average over Center and West, indicating a good representation of the changes in the shape of the distribution. On the other hand over North, South and Iberia a discrepancy is detected, which reveals that for CMCC the number of warm events increases faster than the average maximum temperature. The simulation of a wider distribution is common with the other models over Northern Europe. This similarity is not found for Iberia, South and East, whose trends on TX90p-JJA are underestimated, with values that widely appear stronger than those obtained for the average values. This indicates that these models, and especially HadGEM3, simulate a tendency to a narrower distribution of summer maximum temperature over areas, like Mediterranean and Eastern Europe, where the warm extremes reach concerning values.

Figure 8, showing the difference in absolute trend biases between the HR and LR model configuration, does not show a common pattern. Best improvement in the passage from HR to LR is for MPI (− 0.16 %/dec), in particular for Central and Eastern Europe. At the same time HadGEM3 presents the most intense worsening (+0.25 %/dec), with larger increases of the mean absolute bias over Iberia and Southern Europe. These findings indicate that the reproduction of trends of warm extremes with High Resolution models hasn’t considerably improved over Europe for most of the models.

Fig. 5
figure 5

Same as Fig. 1 but for TX-JJA

Fig. 6
figure 6

Same as Fig. 2 but for TX-JJA

Fig. 7
figure 7

Difference in trends of TX90p-DJF between the considered models and E-OBS. Red(blue) indicates an overestimation (underestimation) of the trend, related to a warmer (colder) trend

Fig. 8
figure 8

Same as Fig. 4 but for TX90p-JJA

4 Summary and conclusions

Six models using their High (HR) and Low Resolution (LR) versions have been compared (over the 1970-2014 period) to E-OBS.hom, a version of the gridded dataset E-OBS based on homogenized daily series (each covering at least 1970-2014) of observed temperatures. The analysis has been performed first on the biases of the seasonal averages and of their trends, focusing on winter minimum temperatures (TNavg-DJF) and summer maximum temperatures (TXavg-JJA) and then on two ETCCDI (ETCCDI 2009) defined indices. These are the percentage of days with minimum temperatures below the 10thpercentile of winter values (’cold nights’, TN10p-DJF) and the percentage of days with maximum temperatures exceeding the 90thpercentile of summer values (’warm day-times’, TX90p-JJA). The percentile thresholds have been calculated using the 1981-2010 period. After the calculation of the trends of the considered indices, for those models with more than one ensemble member (see Table 1), the ensemble mean has been calculated. For each grid-point, average values and trends in the models have been compared to observations and an assessment is made of the difference between the HR and LR model versions. The results have been aggregated over six regions: Iberia, South, East, West, Center and North, see Figs. 1, 2, 3, 4, 5, 67. These ares have been chosen to highlight recurrent behaviours and to allow a thorough analysis of the model performances in different geographical and climatic contexts.

For both winter-mean TN and summer-mean TX strong biases have been found in the simulations, with the strongest ones in all regions for CMCC. This model shows mean bias of + 2.96 °C ∼ for TNavg-DJF up to +4.96 °C ∼ in South and − 3.83 °C ∼ for TXavg-JJA down to − 4.89 °C ∼ in North, indicating an underestimation of the amplitude of the seasonal cycle all over the continent.

On the contrary, the other models present smaller biases in the continental average, with regional anomalies. In particular, the biases of maximum summer temperature show a common North-South gradient in the bias, with warmer values along the European coasts of the Mediterranean and the Black Sea (up to +4.06 °C over Iberia for MPI). This may be related to excessive moisture in Northern Europe and a lack of moisture in the Southern sector (Seneviratne et al. 2006; van Oldenborgh et al. 2009; Lorenz et al. 2010). Furthermore, a recent work Boé et al. (2020) connects larger simulated temperatures to lower evapotranspiration over the sea. Such phenomenon has direct effects on specific and relative humidity, implying a reduction in cloud cover and in precipitation, thus soil moisture.

At the same time, the simulated trends of TXavg-JJA overestimate those observed on Northern and, less often, Western Europe. This result differs from the underestimation of the trends over Southern Europe (e.g. − 0.33 °C/dec for ECMWF, which reduces by a half the observed trend of + 0.66 °C/dec, see Table 4). This pattern confirms the findings of Bhend and Whetton (2013), that detected an underestimation of trends of average summer TX over Europe in CMIP5.

Such a gradient is found, with larger intensity, in the analysis of the biases for the trends on extreme summer maximum temperatures. In particular three models (EC-Earth, ECMWF and HadGEM3) strongly underestimate the increase of days above the 90th percentile. The observed trend over Southern Europe (+2.92 %/dec, see Table 4) is halved by these models, whose biases are respectively − 1.35 %/dec, − 1.64 %/dec, − 1.83%/dec, see Table 6. Similar anomaly is found for ECMWF in spring: − 1.14%/dec, more than a half of the observed + 1.81 %/dec. Nevertheless, these are not isolated cases: the other models (except CMCC) generally underestimate the trends of TX90p on Southern and Eastern Europe for all seasons. This reveals on one side the simulation of narrower seasonal temperature distributions than are observed and, more strikingly, a lack of simulated warm events over an area which is extremely sensitive to the increased strength and length of heatwaves (Della-Marta et al. 2007; Simolo et al. (2010; Squintu et al. 2019).

In Southern Europe and Iberia, the combination of an excessively large negative bias in summer maximum temperatures with a too weak increase in the seasonal average and with a much weaker (compared to observations) increase in the extreme indices points to issues in the representation of soil moisture in the models. In a climate which is too warm the soil can be expected to lack more moisture than in cooler conditions due to enhanced evaporation. Once the soil is dry the radiation balance is shifted to a state where sensible heat is dominant over latent heat. Under boundary conditions where the incoming energy flux (due to increase of green house gases) raises, this implies a further increase in sensible heat and surface warming. Nevertheless in conditions of moist soil, the simulated warming trend in temperatures would be even stronger, due to the shift from latent, thus getting close to the observed conditions (Seneviratne et al. 2006; van Oldenborgh et al. 2009; Lorenz et al. 2010; Min et al. 2013).

When considering the results for winter minimum temperatures common patterns are found among the models. The most evident one is the underestimation of the average winter minimum temperatures over Italy and Norway. These are likely to be related to poor representation of winter minimum temperatures in mountain areas such the Alps, the Apennines and the Scandinavian Mountains. At the same time winter minimum temperatures are overestimated in the plain regions of the north of Sweden and Finland, this is probably connected to a lack of snow coverage simulated by the models, as suggested by van Oldenborgh et al. (2009) and Diffenbaugh et al. (2013).

All the models present an overestimation of the trends in average winter TN over Southern and Central Europe and an underestimation over Eastern Europe (excluding the Kola peninsula), Fig. 2. These colder biases on the trends (− 0.41 °C for CMCC and CNRM) present almost the same amplitude of the observed trend (0.51 °C, Table 4), indicating that these models forecast a very weak increase of winter TN in these regions. This might be linked to an underestimation of the reducing trend of snow coverage compared to the observations (van Oldenborgh et al. 2009). Furthermore, a recent study (Dai et al. 2019) has found a strict connection between sea-ice loss and warmer trends in winter temperatures at high latitudes, the Arctic Amplification. The values in Table 4 confirm that Eastern and Northern Europe have experienced the largest warming trends in winter average and extreme values. Hence, a poor simulation of the Arctic Amplification can be the reason of the colder simulated trends in the areas around the Baltic Sea and the Arctic Ocean (ignoring the Kola issue).

The most relevant anomaly for the trends in the extremes is found for the warm biases over South (− 1.88%/dec for HadGEM3), Center (− 2.29%/dec for CMCC) and Iberia (− 2.29%/dec for CMCC). In these regions the observed trends range between (− 0.56%/dec and + 0.15%/dec, Table 4). Thus, the model trends of the cold extremes are significantly stronger than the observed ones, revealing an excessive narrowing tendency in the distribution of winter minimum temperatures. This relevant anomaly is not detected in the other seasons, only Central Europe presents warmer simulated trends in almost all cases.

The too warm simulated trends on the peninsula of Kola are found for TNavg-DJF and TN10p-DJF (as an underestimation of the number of days below the 10th percentile) and is related to E-OBS station density issues. The only series with observed values in the area (Krasnoshelye) starting before 1970 has missing data between 1972 and 1980. The interpolation of data coming from series in surrounding stations, in the case of TN, brings to higher values in the 1972–1980 compared to the following years, introducing a too cold trend that doesn’t take place in the models. This behaviour, limited to only one series, motivates the ECA&D group to work on further data collection and in increasing the station density in this and other areas. This will enable an increase in the quality of the interpolation and avoid such criticisms.

The combination of the results for TN and TX indicates that around the Mediterranean and in Central Europe the trends in the percentage of events below the 10th percentile and above the 90th are underestimated for a relevant number of regions and models. This implies that the tails of the distribution are simulated to get closer to each other faster than observed. Thus, in these areas the distribution of simulated daily temperatures becomes narrower compared to the distribution of observed daily temperatures, underestimating the intensity of the extremes, especially the warm ones. As aforementioned, several issues in the models can be the reasons behind this (soil moisture, sea surface evapotranspiration). The diagnosis of such problems is not in the scope of this paper, which only aims at identifying eventual issues in the simulations.

As a last step, the analysis of the absolute trend bias evolution in the models from LR to HR does not show a general improvement. Each model presents different patterns and diverse behaviour in terms of change of mean absolute trend bias. Nevertheless this index decreases for TN10p-DJF in almost all models (but ECMWF), indicating a better improvement compared to what is found for TX90p-JJA, where only 3 methods slightly improve and the other ones are worsening by up to + 0.25%/dec.

Finally, it appears that the new high resolution models, even though they do not significantly increase or decrease their absolute bias on the trends of the extremes, still have some problems especially on the area of the Mediterranean. In this region the most serious discrepancy to observations is the large underestimation of the increasing trends of warm extremes. Considering the high economic and societal vulnerability of these areas to very warm events in summer and the importance of the prediction of heatwaves intensity and frequency for the next decades, it is fundamental to improve the simulation of these phenomena and of their projections in future decades.

It’s not in the purpose of this work to pick the best model or to uphold the reliability of the models in their simulations or in their projections to the future. Nevertheless, for most of the models the number of gridpoints that present significant difference between simulated and observed trends is relatively low, even though not ideal. This allows us to affirm that, notwithstanding relevant biases in the seasonal averages, the trends in the average values are trustworthy and the trends in the extremes can roughly describe the general tendencies. Nevertheless the need of serious improvements on the simulation of temperature variability and of its consequences on extreme events is clear.