Euro-Atlantic weather Regimes in the PRIMAVERA coupled climate simulations: impact of resolution and mean state biases on model performance

Recently, much attention has been devoted to better understand the internal modes of variability of the climate system. This is particularly important in mid-latitude regions like the North-Atlantic, which is characterized by a large natural variability and is intrinsically difficult to predict. A suitable framework for studying the modes of variability of the atmospheric circulation is to look for recurrent patterns, commonly referred to as Weather Regimes. Each regime is characterized by a specific large-scale atmospheric circulation pattern, thus influencing regional weather and extremes over Europe. The focus of the present paper is the study of the Euro-Atlantic wintertime Weather Regimes in the climate models participating to the PRIMAVERA project. We analyse here the set of coupled historical simulations (hist-1950), which have been performed both at standard and increased resolution, following the HighresMIP protocol. The models’ performance in reproducing the observed Weather Regimes is assessed in terms of different metrics, focussing on systematic biases and on the impact of resolution. We also analyse the connection of the Weather Regimes with the Jet Stream latitude and blocking frequency over the North-Atlantic sector. We find that—for most models—the regime patterns are better represented in the higher resolution version, for all regimes but the NAO-. On the other side, no clear impact of resolution is seen on the regime frequency of occurrence and persistence. Also, for most models, the regimes tend to be more tightly clustered in the increased resolution simulations, more closely resembling the observed ones. However, the horizontal resolution is not the only factor determining the model performance, and we find some evidence that biases in the SSTs and mean geopotential field might also play a role.


Introduction
Due to its large natural variability, the North Atlantic-European climate is one of the most difficult to predict. The peculiarity of the North Atlantic is clearly seen in climatological maps of the wintertime 500 hPa geopotential height variability, which shows two maxima in this region. The first maximum is found at high frequency-less than 5 days period-along the northeastern coast of North America, where the eddy-driven jet stream enters the Atlantic ocean. The second maximum is located in the northern part of the Atlantic Ocean, south of Iceland and northwest of the British Isles, in the center of action of the North Atlantic Oscillation (NAO) and is characterized by longer periods (5 days to a few weeks) (Blackmon et al. 1984).

Electronic supplementary material
The online version of this article (https ://doi.org/10.1007/s0038 2-020-05271 -w) contains supplementary material, which is available to authorized users.
Weather regimes have been frequently studied in a nonlinear dynamical system perspective. The clear tendency for multimodality in the pdf of the observed field-which departs significantly from a multinormal distribution (Straus et al. 2007Dawson et al. 2012;Corti et al. 1999)-suggests to consider the regimes as real attractors of the chaotic climate system. This hypothesis has been widely studied in simplified models (Christensen et al. 2015;Hannachi et al. 2017) and extended by analogy to complex GCMs (Palmer 1999;Corti et al. 1999).
An alternative perspective to Weather Regimes uses the latitudinal position and maximum speed of the eddy-driven Jet Stream. Woollings et al. (2010) showed that the observed distribution of the jet stream latitude is multi-modal and there are three preferred locations corresponding to different geopotential configurations. A southern shift in the jet stream is linked to a negative phase of the NAO and an anticyclonic anomaly over Greenland, while central and northern jets correspond to the positive and negative East Atlantic patterns (EA), closely related to NAO + and AR regimes (Woollings et al. 2010;Franzke et al. 2011;Madonna et al. 2017). In this framework, the SBL regime covers a wide variety of jet latitudes and is referred to as mixed jet state. As for the jet speed, systematically smaller values are found in the third sector of the NAO-EA phase space (i.e. negative NAO and negative EA), corresponding to part of the SBL and NAO-regimes. Large variability of the jet speed is found in the remaining three sectors of the NAO-EA space. An important step towards reconciling the Jet Stream latitude and Weather Regimes perspectives has been done by Madonna et al. (2017), who performed cluster analysis of the two dimensional lower tropospheric zonal wind patterns over the North Atlantic. Three out of the four "jet regimes" obtained in this way correspond to the three peaks in the one-dimensional jet latitude distribution, and the fourth (linked to the SBL regime) consists of a low or central jet in the western Atlantic that abruptly shifts northward due to the blocking high over Scandinavia. They additionally show how these "jet regimes" well match the usual weather regimes calculated from geopotential height, so building a consistent dynamical picture of the modes of variability over the Euro-Atlantic sector.
Another common and closely related analysis consists in studying the modifications of the unperturbed westerly flow in terms of blocking episodes, that is the presence of a high geopotential height anomaly that forces the jet to shift or to weaken and meander. The climatological frequency of blocked days shows two distinct maxima over northwestern Europe and Greenland, which have their counterparts in the SBL and NAO-Weather Regimes pattern (Scherrer et al. 2006;Davini et al. 2012Davini et al. , 2017Madonna et al. 2017). In this sense, out of the 4 weather regimes described above, we have a single unblocked regime (NAO+) and three "blocked" regimes (SBL, AR, NAO-), although the correspondence might not be completely consistent.
Climate models. Global climate models have improved much in the representation of the large-scale circulation in the last decades, primarily as a consequence of new parameterization schemes and larger computational resources that have allowed to reach resolutions on the order of 100 km and below (Haarsma et al. 2016;Roberts et al. 2018b). However, many problems are still to be solved and the skill in reproducing some features of the observed climate will hopefully improve in future models. Focusing on the North-Atlantic, models are known to struggle to reproduce the observed jet stream variability and blocking frequency. Most CMIP5 models show no sign of multi-modality in the jet latitude pdf (Hannachi et al. 2013;Anstey et al. 2013;Iqbal et al. 1 3 2018) and have a well-known negative bias in the climatological frequency of blocking events over Europe (Anstey et al. 2013;Davini and D'Andrea 2016;Masato et al. 2013;Schiemann et al. 2017).
Also, some studies have focused on the performance of complex GCMs in correctly reproducing the observed Weather Regimes. An example in this regard is the article by Dawson et al. (2012): a low (spectral resolution T159, roughly 125 km grid spacing) and a high resolution version (T1279, roughly 16 km spacing) of the ECMWF model were compared for their skill in reproducing the regimes patterns, their frequency of occurrence, persistence and preferred transitions. The low-resolution version was found unable to correctly reproduce the SBL and AR regimes and produced too many short NAO events. The high-resolution model showed significant improvements, hinting that an increase in resolution is needed to give a reliable representation of North-Atlantic variability. Dawson and Palmer (2015) showed that some improvement is already obtained with an intermediate resolution (T511). A similar analysis has been performed by Cattiaux et al. (2013) on the CMIP3 and CMIP5 versions of the IPSL model at different resolutions. The regimes patterns were found to be better reproduced with the CMIP5 version, which included more vertical levels, and with increased resolution. In terms of frequencies, they found an overestimation of the NAO+ and SBL occurrence, which slightly improved with resolution. More recently, Strommen et al. (2019) analyzed the standard and high resolution versions of three climate models run in atmospheric-only mode (AMIP) and found no systematic improvements in the regime patterns with resolution. However, both the sharpness of the regime structure (see Sect. 3.2) and the persistence of the SBL regime improved with increased resolution.
WRs and climate change. In recent years, there has been increasing interest in studying WRs and how well climate models reproduce them, due to their importance in influencing regional weather patterns and extremes and possibly future regional changes in the climate state (Corti et al. 1999;Matsueda and Palmer 2018). In fact, much attention has been given recently not only to the future changes in the mean state of the general circulation (Hoskins and Woollings 2015), but also to changes in the modes of variability. The interest in the reproduction of WRs in GCMs and in their response to climate change is motivated by different reasons. From a dynamical systems perspective, it has been hypothesized that the first response of the system to a moderate external forcing would manifest in the frequencies of occurrence of the different modes of variability, whilst the variability patterns will remain initially unchanged (Palmer 1999;Corti et al. 1999). From a more pragmatic point of view, WRs are closely connected with regional weather types and weather extremes and constitute a clear and useful framework to study the impacts of climate change in key mid-latitude regions, like Europe and the Mediterranean (Plaut and Simonnet 2001;Yiou and Nogaj 2004;Zampieri et al. 2017;Raymond et al. 2018) or the Atlantic coast of North-America (Roller et al. 2016). Recently, the WRs framework has also been used to assess the variability of wind/solar energy production potential in Europe .
In this paper we report on a multi-model assessment of various diagnostics related to WRs, considering a set of coupled historical simulations using different state-ofthe-art climate models at two resolutions. The main goal of the paper is to investigate the typical model biases in reproducing the observed WRs and their dynamics. Specifically, we assess whether increasing the model resolution produces a significant improvement in the diagnostics in a multi-model ensemble, following on from previous results for individual models (Dawson and Palmer 2015) and for AMIP simulations (Strommen et al. 2019). The paper is structured as follows: Sect. 2 regards the data used in the analysis and describes the models and their configurations; Sect. 3 describes the procedure used to calculate the Weather Regimes and the related diagnostics; Sect. 4 contains the main results of the paper, regarding the model performances and the improvements with increased resolution; in Sect. 5 we analyse some possible drivers of the model performance; Sect. 6 is dedicated to the final discussion and conclusions.

Data: models and observations
The coupled models considered in this work are those participating in the European H2020 PRIMAVERA project: CMCC-CM2 (Cherchi et (Sein et al. 2017). All models follow the HighResMIP protocol (Haarsma et al. 2016), at nominal resolutions ranging from 250 to 25 km. For each model, two versions are available, one at standard (low-res, LR) and one at higher resolution (high-res, HR); some model produced additional intermediate resolutions as well. This is done following the philosophy of the project, which aims to assess the improvements in the representation of the observed climate due only to the increased model resolution. Therefore, the models have been tuned in their low-res version, and the high-res version is obtained by just increasing the resolution, with no additional tuning.
In Table 1, all models are listed with their basic characteristics, their effective resolutions in both the atmosphere and ocean components, and the number of ensemble members analysed for each resolution. Note that most models increased the resolution of both the atmosphere and the ocean components, with the exception of CMCC-CM2 and MPI-ESM1, increasing the atmospheric resolution only.
Under the PRIMAVERA project, models are run both in atmosphere-only and coupled mode. Here we consider the coupled historical simulations (hist-1950), covering the range 1950-2014. For each simulation we use the geopotential height daily mean data at 500 hPa, as explained in Sect. 3.1. As a reference, we take geopotential height daily reanalysis data at 500 hPa from ERA40 (1957ERA40 ( -1978 and ERAInterim (1979ERAInterim ( -2014, thus covering the 1957-2014 range.

Method and diagnostics
The following subsections describe the methodology used to calculate the Weather Regimes together with the diagnostics and metrics applied to compare observed and simulated regimes. The work-flow has been implemented in a Python software package named "WRtool", freely available at https ://githu b.com/fedef 17/WRtoo l.

Calculation of the weather Regimes
The calculation of Weather Regimes is performed here through clustering of the geopotential height anomalies at 500 hPa (Z500) in a reduced phase space. This approach is widely used in literature, even if many subtle variations in the actual procedure have been adopted (Hannachi et al. 2017;Straus et al. 2017). Despite these differences, the main results of the analysis are quite robust, as documented by many accurate tests in literature (Straus et al. 2007;Hannachi et al. 2013).
In this work we consider the wintertime (December-February; DJF) daily Z500 fields and proceed through the following steps.
Data pre-treatment and removal of the mean seasonal cycle. The model (and the reanalysis) data are first interpolated to a 2.5 • × 2.5 • grid using bilinear interpolation. This is done mainly for practical reasons related to data-handling, i.e. to have all the data at the same spatial resolution. The interpolation does not affect the results, mainly because we are interested in large scale patterns. We then calculate the mean seasonal cycle and subtract it from the data, to obtain a timeseries of daily geopotential height anomalies. The seasonal cycle is calculated averaging the full 1957-2014 timeseries day-by-day and then performing a running mean of 20 days. Other authors used a running mean of 5 days (Dawson et al. 2012;Strommen et al. 2019) or calculated the timeseries expansion in terms of Legendre polynomials (Straus et al. 2007). We chose the 20 days running mean in order to avoid higher frequency fluctuations due to internal variability. We considered whether to detrend the seasonal cycle, in order to remove the climate change signal on the mean state: the trend in the wintertime 500 hPa geopotential height was found to be comparable to decadal variability in some regions (with a maximum of about 20 m in the southwest part of the domain) and generally smaller elsewhere (not shown). We finally decided not to apply the detrending, because considering shorter (30 years) periods introduces unwanted noise from decadal variability in the entire domain and in the WR attribution.
We select the Euro-Atlantic (EAT) domain (30 • -90 • N, 80 • W-40 • E), which is one of the most used in the literature (Dawson et al. 2012;Madonna et al. 2017;Strommen et al. 2019), although some authors considered slightly different sectors (Cassou 2008;Cattiaux et al. 2013). We checked that Table 1 Models used in the analysis, listed with their components (atmosphere/ocean/ice models), the atmospheric grid used for the two versions (low-and high-res), the nominal resolution and number of levels used for the atmosphere and ocean components, the num-ber of ensemble members analysed for each resolution. Note that for HadGEM-GC31 (LL, MM, HM, HH) and ECMWF-IFS (LR, MR, HR) more than two model versions have been analyzed the observed regimes' patterns are robust to small changes in the choice of the domain (of the order of 10 • ). It is worth noting that the Euro-Atlantic domain used here extends much more eastward (40 • ) than the usual North-Atlantic region adopted in Jet Stream studies (Woollings et al. 2010).
Calculation of the EOFs and projection onto the reduced phase space. We then calculate the Empirical Orthogonal Functions (EOFs) of the anomalies on the Euro-Atlantic domain and define a phase space of reduced dimensionality as that spanned by the first 4 EOFs, representing about 55% of the variance. The choice of the number of EOFs varies in literature from 4 (Dawson et al. 2012) up to 14 (Cassou 2008;Cattiaux et al. 2013) or more. Sensitivity tests have been performed by (Straus et al. 2007) for the Pacific-North American sector and show that the mean regime patterns are quite insensitive to this choice, but the clustering becomes less significant with a larger number of EOFs. We performed similar tests for the EAT domain, finding that regime patterns and cluster attribution are robust to this choice.
We proceed differently at this point for the models and for the reference. Repeating this step for each simulation would produce different EOFs and then different phase spaces, and would constitute an additional factor to take into account in the analysis. For example, it can also happen in some cases that the 4th model EOF is the model counterpart of the 5th EOF in the observations or so, thus heavily penalizing the model performance in some regards.
For these reasons we chose to use the same reduced phase space for all simulations, which is that spanned by the 4 leading EOFs obtained from ERA reanalysis ("phase space" in the following). The daily anomalies are then projected on this space, thus obtaining a timeseries of 4 Components (pseudo-PCs) for each simulation. Strictly speaking, these are not Principal Components, since we are not projecting on the model EOFs. Using the same space for all models allows comparing the regime patterns in a consistent way (for example, computing distances and angles). However, this choice might affect some of the metrics applied to evaluate the regime structure, so we specifically comment on this in Sect. 3.2.
Clustering and Weather Regimes attribution. We apply a K-means clustering algorithm to the models' pseudo-PCs and observation PCs, setting the number of clusters to 4. This number has been found to give the most significant clustering for the wintertime Euro-Atlantic sector (Michelangeli et al. 1995;Cassou 2008;Straus et al. 2017;Hannachi et al. 2017). Each day in the pseudo-PCs timeseries is assigned to a cluster, minimizing the distance in phase space to the cluster centroid. The mean regime patterns are calculated as composites of all the points belonging to the corresponding cluster. Once the model clusters have been calculated, their order might differ from the reference ones. Therefore a best-matching algorithm is applied, which reorders the simulated regimes so as to minimize the total RMS deviation between the observed and simulated patterns.

WR-related model diagnostics
The Weather Regimes are first calculated for the reanalysis dataset and used as a reference for the simulated WRs. The WR patterns obtained here for the ERA reanalysis are shown in Fig. 1. Using a different reanalysis like JRA55 or NCEP produces very similar results (Strommen et al. 2019;Dawson et al. 2012). In line with previous works (Dawson et al. 2012;Cattiaux et al. 2013;Strommen et al. 2019), we evaluate the model performance in reproducing WRs in terms of different metrics: (i) Regime centroid and mean pattern. For each regime, the cluster centroid in phase space is given by the average of all pseudo-PCs assigned to that regime. This corresponds to a mean regime pattern, which is the simplest way to visualize the preferred large scale geostrophic flow configuration corresponding to that regime. A subtle difference exists between the mean pattern calculated in this way and that obtained through the composite of all daily anomalies corresponding to the cluster. The latter considers the anomalies in the full space, instead of only the 4 dimensions given by the reference EOFs. However, since higher order variability is quite uniform across the regimes, the two patterns are almost indistinguishable, and we use the reduced phase space pattern for consistency with the other metrics. (ii) Regime "clouds" in phase space. The comparison of the simulated and observed regime patterns is good to get an idea of the model performance "at a glance". However, there is much more information stored in the statistics of the pseudo-PCs belonging to each regime. For example, the location of the centroid does not provide any information about the shape or the width of the distribution. Therefore, we apply a suitable metric to compare the model regime "clouds" in phase space. For each regime, we calculate the relative entropy of the modeled distribution with respect to the reference one. The relative entropy or Kullback-Leibler divergence (Kullback and Leibler 1951) is a measure of how much the two distributions differ from an information theory perspective: the larger the relative entropy, the further the two distributions are; a value of 0 means that the two distributions are identical. The reference and model distributions of the regime clouds are calculated through a Gaussian kernel approach for each regime separately on a fixed grid, considering the first 3 pseudo-PCs only (for computational reasons). The relative entropy is given by: where P is the reference distribution obtained from ERA, Q is the simulated one and the x i are all points in the 3-dimensional grid. (iii) Variance ratio. Also called optimal variance ratio, this is the ratio between the mean inter-cluster squared distance and the mean intra-cluster variance. The larger the variance ratio, the more clustered are the data, giving compact clusters well apart one from the other. The variance ratios are always below unity for the data we analyze here, meaning that the clusters are somewhat overlapping. Formally, the variance ratio is obtained in the following way: (1) where the first summation is done over all the combinations (without repetition) of the clusters ( C ij ), and are the respective cluster centroids, n is the number of clusters and 2 i is the intra-cluster variance.
(iv) Sharpness. This measure (also called significance in previous works) is closely connected to the variance ratio. A Monte Carlo test is performed on the pseudo-PCs, to assess whether there is evidence of multi-modality and hence non-normality. Following previous works (Straus and Molteni 2004;Straus et al. 2007;Dawson et al. 2012;Dawson and Palmer 2015;Strommen et al. 2019), we first construct 1000 synthetic data series drawn from a multinormal distribution with the same length, lag-1 autocorrelation and variance as the original pseudo-PCs. The sharpness is defined as the percent of synthetic data series that have a variance ratio smaller than the original one (i.e. that are "less clustered"). (v) Regime frequencies. The frequency of occurrence of each regime is calculated simply as the ratio of days belonging to that regime to the total number of days. (vi) Regime persistence. A regime event is defined as the set of consecutive days belonging to the same Fig. 1 Mean pattern of the weather regimes obtained from a combination of ERA40 andERAInterim (1957-2014) regime. Since sometimes a single day pause breaks a long regime event, we relaxed this definition allowing for a single day belonging to a different regime in the set of a regime event. The distribution of the length of the regime events in each simulation is analysed and compared to the reference one. (vii) Regime Jet Latitude distribution. The dynamical connection between WRs and eddy-driven Jet Stream is analyzed here calculating the distribution of the Jet Latitude index corresponding to each regime, in a similar way to what Madonna et al. (2017) did for the reanalysis. The regime-related distributions are then compared to the reference ones. The Jet latitude distributions of the hist-1950 PRIMAVERA simulations are thoroughly analyzed in Athanasiadis et al. (in prep.). Here we use their calculations to analyze the link between Euro-Atlantic WRs and Jet Stream. The calculation considers the daily mean uwind field at 850 hPa, low-pass filtered using a Lanczos filter at 10 days. The jet latitude is defined as the latitude of maximum Jet speed, after zonal averaging between 60W and 0 longitude. The index is consistent with the one defined in Woollings et al. (2010), apart from an additional filtering of grid points with orography higher than 1300 m. (viii) Regime Blocking frequency. We also computed regime composites of the 2D blocking frequencies and compared them to the reference ones in terms of mean frequency over the EAT sector. The Blocking frequencies are calculated as in Schiemann et al. (2017) using the absolute geopotential height (AGP) blocking index (Scherrer et al. 2006). The AGP index is a two-dimensional extension of the Tibaldi and Molteni (1990) blocking index. Two conditions are needed to define a blocking event at a specific grid point: reversal of the climatological equator-pole gradient of the 500-hPa geopotential height to the south and a westerly flow to the north. There is the additional requirement that these conditions are met for at least 5 consecutive days. A thorough analysis of blocking in the hist-1950 PRIMAVERA simulations is performed in Schiemann et al. (2020). Here we are mainly interested in the connection of the blocking index with the WRs.

Model performance and improvements with increased resolution
A common problem in climate studies that compare coupled historical runs with the observations is that only a "single" observed history is available. If the differences between the simulated and observed climates are inside the range of the internal variability, it is very difficult to assess whether they reflect some real bias of the model or only different phases of the internal oscillations. One way to estimate the internal variability on decadal timescales is to perform a bootstrap analysis to the available dataset. This has been done consistently for all simulations and for the observations, considering sets of 30 seasons randomly chosen (with repetitions allowed) among all available seasons between 1957 and 2014. Repeating this step 500 times gives an estimate of the distribution of the various metrics and allows to properly compare models and observations. Figures 2, 3, 4, 5, 6, 7, 8, 10 (with the exception of Fig. 9) show box plots that represent the distributions obtained through the bootstrapping. For each model, the plots show median (horizontal line), first and third quartile (boxes) and 10 and 90 percentiles (bars). For some models, more ensemble runs with the same model configuration were available (see Table 1): in those cases, the distribution median and percentiles are calculated among all members. Mean and extreme (max/min) values across the ensemble of each member's mean metrics are also shown in terms of a dot and two small triangles. At the right of the gray vertical line, three boxes are shown. The first (black box) refers to the ERA 30-year bootstraps and is obtained in the same way as for the models. The other two boxes represent average quantities among all the LR and HR models: in this case the meaning of the boxes and bars does not correspond to the true percentiles and median of the overall distribution, but are calculated as the average of the percentiles and median over all models. For the models with more than two resolutions, only the lowest and highest resolutions are taken into account for the LR and HR averages. The different number of ensemble members does not affect the multi-model average: each model weighs one regardless of the number of simulations performed.
In the following, we dedicate a paragraph to each diagnostic, aiming to rigorously assess the existence and magnitude of model biases.

Regime centroid and mean pattern
Weather Regimes' centroids (i.e. the mean anomalies) are usually poorly simulated in simplified models of the atmospheric circulation or in GCMs at low resolutions (Dawson et al. 2012), at least for some of the regimes. With state-ofthe-art climate model resolutions, the patterns have greatly improved and there is a qualitatively good matching between simulated and observed regimes. More quantitatively, we calculated the distance in phase space between the simulated and reference centroids and the pattern correlation between the respective mean patterns. Figures 2 and 3 show the performance of different models in terms of the phase space distance between simulated and reference centroids (Fig. 2) and in terms of the correlation between the simulated and reference mean patterns (Fig. 3). For each model, the fainter color corresponds to the lowest resolution and the stronger color to the highest resolution version. At the right end of the plot, a measure of the observed variability (black box, named "ERA") is shown along with the ensemble average of the lowest and highest resolution versions of each model. For all regimes but the NAO-, most of the models show an improvement with increased resolution, both in terms of the centroid-to-centroid distance and of the mean pattern correlation. The improvement is not seen for NAO-, where we notice a slight degradation for most models. However, this might in part be due to the fact that the NAO-regime is already well reproduced in most models and shows in fact the smallest mean distance and the largest pattern correlation.

Regime clouds in phase space
The differences between the simulated and observed regime clouds are evaluated through the relative entropy between the two sets of distributions, as explained in Sect. 3. The results of the comparison are shown in Fig. 4. In most cases, the direction of change is consistent with the results obtained comparing the centroids and mean patterns, showing an improvement (i.e. a lower relative entropy) with high-res for most models and all regimes. It is worth recalling here that the pdfs have been corrected for the offset in the centroid position, so that the relative entropy in Fig. 4 is due only to the regime cloud shape and spread.

Variance ratio and sharpness
The variance ratio and sharpness of the regime clusters are the metrics that most clearly show the differences between models and observations. Figures 5 and 6 show the evaluation of these two quantities for a 30-year bootstrap analysis of the models and observations.
As it is clearly seen in Fig. 5, almost all simulations have a lower variance ratio than the reanalysis, and most of them are outside the range of ERA variability. The only notable exception are the LL, MM and HM simulations of the HadGEM-GC31 model, that have values in the lower half of the observed range. This means that the models are not yet able to fully capture the regime structure of the observed geopotential field in the North Atlantic and the simulated regimes tend to be systematically less tightly clustered. Most models improve with the increased resolution, apart from MPI-ESM1-2 and HadGEM-GC31. On average, the HR model versions perform better in that the lower tail of the distribution is significantly shifted towards higher values. A small increase is also noted in the median of the distribution. Also for the sharpness (Fig. 6) the model biases are very clear: all models have a smaller sharpness than the observation and show a much larger range of variability. Although the range of variability might be enhanced by the construction of the sharpness measure, which is highly nonlinear and threshold sensitive, the difference between models and observations remains clear and outside the range of the observed variability. In this case, the models respond in different directions to the increased resolution: some of them get better (AWI-CM-1-1, CNRM-CM6, HadGEM-GC31), other (EC-Earth3P, CMCC-CM2, ECMWF-IFS) do not show significant shifts of the distribution and MPI-ESM1 gets worse. To further complicate this picture, the change is not monotonic with increasing resolution when considering a single model: both ECMWF and HadGEM-GC31 show a different behavior between the MR and LR resolution versions and the HR and MR versions. Similarly to the variance ratio, when looking at the ensemble average of sharpness, a depopulation of the lower quartile of the distribution is seen in favour of the upper quartiles, showing   an overall tendency to go towards the observed value. Strommen et al. (2019) found a consistent improvement of sharpness with increased resolution analyzing a set of AMIP simulations (3 models). Our average result across the LR and HR ensembles goes in the same direction, but the increase is not systematic, with a strong modeldependent footprint.
The projection of models' anomalies onto the reference EOFs-instead of the models' EOFs-(see Sect. 3.1 for details) might have an impact on the simulated regime structure. In particular, a positive (negative) bias can affect the variance ratio and sharpness metrics. This possibility has been tested by checking these two quantities using for each model the phase space spanned by their own EOFs. The results (see Figures S1 and S2 of the supplementary material) are consistent with those shown in Figs. 5 and 6. Although the values of both quantities are smaller when using the models' phase space and significant differences in sharpness can be seen for individual models, the general conclusions reported above remain valid.

Regime frequencies and persistence
The frequency of occurrence of Weather Regimes and their persistence are key quantities representing the dynamics of the system, and of primary importance for impact studies. Figure 7 shows the frequencies for all regimes, as obtained from the 30-yr bootstraps.
As we can see, the distributions of simulated regime frequencies overlap at least partly with the observed one, though some departures from the observations can be seen. For the NAO+ regime, a systematic underestimation of the frequency is found in almost all models, which are unable to catch frequencies in the upper half of the observed distribution. On the other side, most models overestimate the frequency of the AR regime, although the inter-model variability is much larger in this case. The frequency of NAO-and SBL regimes are better caught by the models, at least by the LR and HR averages. No systematic shift of the frequencies of the NAO+ and AR regimes is seen with increased resolution. This is clearly shown by the LR and HR averages that are remarkably stable and have a similar range of variability. A small positive (negative) shift of the frequency distribution is seen on average for the NAO-(SBL) regime, however the different models do not behave consistently in this case. Figure 8 shows the average regime duration in days. For both the NAO regimes, most models systematically underestimate the average regime duration. This is true on average also for the SBL regime, although some models are able to reach the observed range. The same results are obtained if the analysis is performed in terms of regime persistence probability (not shown). It is worth noting that the observed range of variability is significantly larger for the NAO-regime, meaning that there is a large interannual variability on the duration of this regime. The behaviour of the models with increased resolution gives a contrasting result: we found a higher persistence of the NAO+ regime in most models and a lower persistence-in all models but one-for the NAO-regime.

Weather Regimes and jet latitude
The link between the WRs and the peak latitude of the Jet Stream in the North-Atlantic is analyzed here in terms of the Jet latitude distributions corresponding to each regime. As assessed in Madonna et al. (2017), the observed WRs are characterized by specific shifts of the Jet Stream. NAO+, Atlantic Ridge and NAO-see prevailing central, northern and southern jets. The situation is different for Scandinavian Blocking, where the jet is usually tilted from South to North and this produces a much broader distribution of jet latitudes in the 1D framework. Figure 9 shows the results of the same analysis for the models analyzed here. Only the models analyzed in the paper by Athanasiadis et al. (in prep.) are shown. As can be seen, the models usually do quite a good job for NAO+, where the two ensemble averages are barely distinguishable from the observed one. For AR and NAO-, the models are ping is involved. The two ensemble averages are calculated as before, as the average of the median and percentiles over the ensemble still limited in reaching the displacement seen in the real jet: for both regimes, they tend to produce too many central jets. The ensemble averages show in both cases a small improvement with increased resolution. In the case of SBL, a significant bias exists in the spread of the distributions, which tend to be confined to central latitudes for most models and systematically miss the observed northward extension.

Weather Regimes and blocking frequency
This section aims at assessing whether the models are able to reproduce the link between WRs and blocking frequency that we observe in the reanalysis. An overall assessment of the blocking performance of the PRIMAVERA models is discussed in Schiemann et al. (2020). Figure 10 shows the mean blocking frequency over the North-Atlantic sector corresponding to each regime. Numbers indicate the fraction of blocked days per grid point. For the NAO+ and AR regimes, models have a general good agreement with the observations. The model deficiencies are mostly seen for the other two regimes: both for NAO-and SBL the blocking frequency is too low in models. For the SBL, the increased resolution has a positive impact, increasing the blocking frequency in most models. On the other side, no clear effect is seen for NAO-.

Possible drivers of model performance
In the previous section we analyzed the results obtained for each metric both for the individual models and for the two HR and LR ensembles. An obvious question is whether it is possible to attribute the varying model performances to some simple, underlying drivers. To do this, a systematic computation of correlations between model performance and some plausible drivers has been performed. The underlying hypothesis is that the effect of these drivers might have a notable (linear) impact on the metrics, giving a signal beyond the "noise" due to other model-specific features. Concretely, we correlated all the metrics in Sect. 4 with the following quantities: -mean SST in the North-Atlantic; -mean geopotential height at 500 hPa; -low frequency variability of the geopotential at 500 hPa (i.e. 2D standard deviation at periods longer than 5 days); -stationary eddies of the geopotential at 500 hPa (i.e. the time-mean departure from the zonal average); -blocking frequency; -Jet latitude variability (i.e. the difference between the 90th and 10th percentiles of the jet latitude distribution).
To compute the correlations with the atmospheric/oceanic resolution we used the nominal horizontal grid spacing in km from Table 1, so that a negative correlation means that the metric increases with increasing resolution (i.e. with smaller grid spacing). For mean SST, mean geopotential, low frequency variability, stationary eddies and blocking frequency we used the pattern correlation between the observed and simulated pattern inside the EAT sector. The results are shown in Table 2, for three metrics: sharpness, variance ratio and the mean pattern correlation between the reference and simulated regimes. Bold values indicate that the p-value is lower than 0.05. For each driver/metric couple we show two values: the first is calculated for all models, the second excluding the CMCC model, which has a significantly lower number of vertical atmospheric levels (see Table 1). This was done because that model is a consistent outlier for the three metrics considered (see Figs. 3, 5 and 6) and for some of the drivers (mean geopot. field, low fr. var., stat. eddies, blocking frequency), and thus might skew correlations. In fact, computing correlations with and without CMCC produced notably different numbers in some cases. We also tried removing all outlier models separately for each metric (i.e. those lying further than 2 sigma from the multi-model average) and got very similar results.
At a first look, the sharpness is correlated with many of the proposed drivers. However, most of these correlations are spuriously produced by the outlier models and disappear or are strongly reduced when they are removed. After the filtering, there is a residual significant negative correlation of sharpness with the atmospheric grid spacing and a smaller one (non significant) with the oceanic grid spacing. This goes in the direction commented in Sect. 4, although the correlations are not very high, implying that other factors might be at play at the same time. Also, this is in line with the results by Strommen et al. (2019), which observed a systematic increase of sharpness with higher atmospheric resolution.
The agreement with the observed WR patterns shows only small correlations with the proposed drivers. The only consistent correlations before and after the filtering are found for the atmospheric grid spacing (negative, quite small). This is in line with the conclusions in Sect. 4, although the differences between the models are also playing a role here.
The variance ratio shows the most robust and interesting correlations. A small positive correlation with the mean SST pattern in the North-Atlantic becomes significant after the filtering. This suggests that models which have SSTs closer to the observed ones tend to produce more evident and welldefined regimes. Another important driver of high variance ratio is the agreement with the observed mean geopotential field and the pattern of stationary eddies. These correlations are strong both before and after the filtering. The positive correlations with the blocking frequency, low frequency variability and jet variability confirm how tightly related the different perspectives are.

Discussion and conclusions
Among all the metrics analyzed in Sect. 4, many show clear model biases. With "clear" we mean here that we can distinguish between models and observation just looking at some specific metrics. Some of these biases decrease in Table 2 Pearson correlation between all possible drivers considered (rows) and three WR metrics (columns) The value in the left column is calculated for all models, the right one excluding the CMCC model (more comments in text). Significant correlations are shown in bold font the higher resolution versions of the models, while others seem to be insensitive to the model resolution. We will try in this Section to assess the most important model biases and which metrics most benefit from the increased horizontal resolution.
Overall, state-of-the-art coupled climate models are better at reproducing the observed North-Atlantic WRs mean patterns than their predecessors and there is some indication that this is at least partly due to increased model resolution. Dawson et al. (2012) saw a clear improvement when passing from T159 to T1279, in their single model study. Dawson and Palmer (2015) suggested that there is already a substantial improvement at T511 with respect to lower resolutions. In agreement with those studies, we also see an improvement for most models and for all regimes apart the NAO-. On the other side, Strommen et al. (2019) saw no improvement or even a slight degradation in the regimes patterns. However, these works (Dawson et al. 2012;Dawson and Palmer 2015;Strommen et al. 2019) were based on the analysis of AMIP simulations, so the results might not be directly comparable to ours. Also Cattiaux et al. (2013) saw an improvement in the WR patterns, but only between the lowest resolution (96 × 71) and all the others (from 96 × 96 to 192 × 142). For this reason, they claim there is a threshold resolution for such an improvement. The question remains of how far we want to push the goal for the simulated WR patterns. When looking at Figs. 2, 3 and 4 it is clear that the bootstrap estimate of ERA variability is much closer to zero (or 1 in the case of pattern correlation) than the models are. However, the goal for the models might be put somewhat further from the perfect match, taking into consideration that the observed internal variability might be larger than that estimated on the basis of the last 60 years. Nevertheless, some models perform better than the ensemble averages, so there seems to be still room for improvement, although this is probably not going to be obtained through increasing the horizontal resolution only.
As for the regime frequencies, we observed a systematic underestimation of the NAO+ occurrence, with a smaller overestimation of the AR regime. A systematic underestimation in the occurrence of the NAO+ was also seen in the single-model analysis by Cattiaux et al. (2013), although their model had a larger positive bias for NAO-and negative for SBL. No effect on the NAO+ frequency bias is obtained with the increased resolution, but the bias is clear and needs to be tackled in future models. We also see a systematic tendency to underestimate the NAO-and NAO+ persistence, which is in line with previous works (Strommen et al. 2019;Matsueda and Palmer 2018). We see an improvement with HR for NAO+ only and a worsening for NAO-with the increased resolution.
The quantities that most clearly distinguish between models and observation are the variance ratio and sharpness. As assessed in Sect. 4, no model is able to reach the observed value, though some of them get very close to it. For both quantities, we see a depopulation of the lowvalue tail of the models and a reduction in the model bias with increased resolution. However this is not systematic across the ensemble and the effect is model-dependent. A stronger signal was originally expected, in line with what Strommen et al. (2019) saw for three models. Again, the difference might be that the models are run in coupled mode here and not atmosphere-only as in Strommen et al. (2019). As analysed in Sect. 5, there seems to be some influence of the SST bias on the variance ratio, which would then tend to be higher in AMIP runs. However, we do not see an analogous correlation for sharpness, which is also related to the data autocorrelation at one day lag. Also, some role might be played by model tuning: a choice was made in the PRIMAVERA project, not to tune the increased resolution versions of the models. On one side, this assures that no other parameter changed apart from the model resolution; on the other, the increased resolution versions might be less equilibrated than the nominal ones, thus producing a contrasted output. The potentially negative effect of model tuning is expected to be stronger in the case of coupled simulations than for AMIP runs.
The correlation analysis in Sect. 5 gives some additional hints, although the signal across the ensemble may be somewhat suppressed by other model-specific features. The horizontal atmospheric resolution has a significant correlation with sharpness and a small correlation with the WR patterns, that are encouraging if summed to the results of Sect. 4. However, the most interesting metric is probably the variance ratio, which is a good candidate for a synthetic indicator of the model performance in reproducing WRs. Indeed, the variance ratio shows a robust correlation with the low frequency variability and blocking patterns, and also with the jet latitude variability, indicating the strong connection between these perspectives. On the other side, the variance ratio is determined to a certain extent by the model bias in the SSTs and the mean geopotential field over the North-Atlantic. No correlation is found between the variance ratio and the atmospheric resolution, although most models actually improve when increasing the resolution (see Sect. 4.3).
The general picture that appears is that the model performance in reproducing the wintertime Euro-Atlantic WRs tends to improve by increasing the models' horizontal resolution, regarding WR patterns, sharpness and variance ratio. No clear effect is seen on the regime frequency and persistence. However, the model horizontal resolution is clearly not the only key and the real picture is very complicated. Reducing the model biases in the mean geopotential and SST fields might have an effect on the variance ratio and hence on the regime structure. It would be very interesting in this sense to have more insight in the role of model tuning, aside that of increased resolution, and to repeat the present analysis on a tuned high-res version of the models.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.