1 Introduction

Reliable high-resolution climate change projections are highly sought after by the climate impacts community (Giorgi et al. 2009). Despite ongoing investment in the latest generation of global climate models (GCMs) within the Coupled Model Intercomparison Project Phase 6 (CMIP6), the typical resolution of these models (~ 100-km) remains prohibitive for direct use in many climate impacts applications and contexts. At this resolution, GCMs often struggle to adequately represent various features of atmospheric circulation that drive mid-latitude temperature and precipitation extremes, including atmospheric blocking (Davini et al. 2017; Quinting and Vitart 2019; Davini et al., 2020), tropical cyclones (Roberts et al. 2020; Gibson et al. 2023) and short-duration convective storms (Chan et al. 2014; Thomassen et al. 2021). At finer local and regional scales, the biases from coarse resolution GCMs can be larger and more detrimental. This is especially the case in regions of complex terrain, where the interaction between mesoscale circulation features and orography creates fine scale spatial variability poorly resolved by GCMs (Giorgi and Gutowski 2015; Gibson et al. 2023). These features are highly important for defining the regional climate and include orographic precipitation, mountain and valley breezes, lapse rates in the boundary layer, as well as sea and land breezes.

Given the current limitations in the resolution GCMs, regional climate models (RCMs), which are generally run at much finer resolution (e.g. 12-25-km), are often relied upon to ameliorate these issues. The Coordinated Regional climate Downscaling Experiment (CORDEX) is a prominent example, designed to coordinate international regional downscaling efforts for climate change impact and adaptation studies (Giorgi and Gutowski 2015). Through CORDEX, several participating groups implement RCMs to downscale selected GCMs for a particular region. The climate projection stream first involves selecting a small number of best-performing GCMs to downscale with a small number of RCMs, where performance is evaluated on a regional basis. Other factors in the GCM selection process include model independence and spread in future warming rates. A balanced selection of models across different institutional ‘families’ and spanning a reasonable range of future warming rates is desirable (Brunner et al. 2020; Hausfather et al. 2022; Grose et al. 2023). The RCM performance and chosen configuration are typically based on evaluating biases across reanalysis-driven RCMs, or different configurations of a single RCM, under “perfect” boundary conditions. Reanalysis-driven simulations are particularly useful for isolating biases that come directly from the RCM, as well as providing a best-case expectation of performance when the RCM is used to downscale GCMs (Di Virgilio et al. 2019).

Building upon CMIP5, several studies have evaluated the added value of RCM downscaling in various regions. In this context, added value refers to the reduction in biases in the RCM relative to the host GCM at the regional scale (Giorgi and Gutowski 2015). The concept was extended by Di Virgilio et al. (2020) to include ‘realised added value’ where added value from the RCM refers to both a reduction in biases and a modification of spatial details in the climate change signal. Several studies have reported added value from RCMs, especially for representing precipitation variability and extremes in various regions (Lloyd et al. 2021; and references therein). Across Australia, Di Duca et al. (2016) and subsequently Di Virgilio et al. (2020) found evidence for added value in both temperature and precipitation mean and extreme fields, though this varied across regions and seasons. Even in cases of added value, important systematic biases in RCM output can remain which require careful evaluation. Over parts of Europe, Knist et al. (2017) found a tendency for RCMs to overestimate land-atmosphere coupling strength due to issues in the land-surface models, with implications for the representation of heatwaves. Under ‘perfect’ boundary conditions (i.e. reanalysis-driven), Di Virgilio et al. (2019) noted a tendency in some RCMs to underestimate maximum temperature over large parts of Australia by more than 5oC, while Hirsch et al. (2019) noted a tendency to underestimate heatwave frequency and intensity in the same RCM ensemble. For precipitation extremes across the United States, Gibson et al. (2019) noted cases of RCM added-value through increasing precipitation rates that were otherwise too small in most GCMs, though the highest resolution RCMs evaluated (12-km) began to overestimate extreme events. Other studies have emphasized ongoing systematic biases with the large-scale atmospheric circulation features in GCMs which are inherited by RCMs and degrade the downscaled output (Lloyd et al. 2021), though this issue can be at least partly addressed through the GCM selection process. Due to these regional and seasonal variations in RCM performance, a comprehensive evaluation of added value is important for setting expectations regarding the use of the output from RCMs in climate impact and adaptation applications.

New Zealand provides an ideal case study for evaluating added value from RCMs. Situated within the prevailing Southern Hemisphere storm track, New Zealand is exposed to a diverse range of synoptic conditions that drive extreme events. Especially notable in summer, ex-tropical cyclones (Lorrey et al. 2014) and atmospheric rivers (Prince et al. 2021) tap into sub-tropical sources of moisture to drive extreme rainfall. In winter, a diverse range of fronts, cyclones, and marine polar air masses reach New Zealand relatively unimpeded (Sturman and Tapper, 1996). At smaller scales, the highly complex mountainous and coastal terrain give rise to important local and mesoscale circulation features (Sturman et al. 1999). Due to the uplift from the Southern Alps, orographic precipitation in western regions results in some of the largest annual rainfall totals on Earth (Ibbitt et al. 2000). To be fully successful, regional downscaling must capture the full range of these processes and their relevant interactions with complex terrain at high spatial resolution (Drost et al. 2007; Rampal et al., 2022a).

Compared to other regions, the literature evaluating dynamical RCMs and added value for New Zealand is relatively sparse. Early work, prior to the availability of CMIP3 output, focused on applying a relatively coarse RCM (50-km) developed by CSIRO to output from a single GCM (Renwick et al. 1998, 1999). This included a simple climatological evaluation as well as describing the regional transient climate response to doubling of CO2. Also prior to CMIP3, Drost et al. (2007) evaluated the HadRM3H regional model (40-km) embedded within the global model HadAM3H, which is derived from the atmosphere component of the coupled model HadCM3. The authors show several improvements at high resolution, despite a tendency for simulated temperatures to be too low and the precipitation totals to be too high in high elevation regions. Around the time of CMIP3, Ackerley et al. (2012) evaluated an updated version of the same regional model at higher resolution, HadRM3P (~ 30 km) driven by both reanalysis and GCM data. The authors describe general improvements at this higher resolution, including for the west-east gradients of precipitation around the Southern Alps but with an overall negative bias in climatological precipitation totals. There was also a general tendency for a negative bias in maximum air temperatures and a positive bias in minimum air temperatures (i.e. overall low bias in daily temperature range).

Based on the results from Ackerley et al. (2012), the CMIP5 downscaling for New Zealand was carried out using bias corrected SST and sea-ice concentration (SIC) fields from 6 selected CMIP5 GCMs to drive the global HadAM3P model (~ 150-km), which was then downscaled over the New Zealand domain in a second step with HadRM3P (~ 30-km) (Ministry for the Environment 2018). These CMIP5 downscaled simulations have been used extensively in various applications in New Zealand, including for catchment-scale (Jobst et al. 2018; Akhter et al. 2019) and national-scale hydrology (Collins 2020) fire weather (Melia et al. 2022), and damages from extreme events (Pastor-Paz et al. 2020) among others. While general climatological biases have been evaluated, no comprehensive assessment of added value has been carried out for RCMs in the region, including for extreme events. More recently, Gibson et al. (2023) applied and evaluated the global non-hydrostatic Conformal Cubic Atmospheric Model (CCAM) with a stretched grid configuration producing high-resolution (~ 12 km) simulations over New Zealand. The experiment design there focused on internal variability over the historical period, generating a 10-member initial condition ensemble member from observed SST/SIC alone. Encouraging results were obtained over the New Zealand region in terms of precipitation and temperature-based extreme indices, which motivated the wider application and evaluation of CCAM for CMIP6 dynamical downscaling in the present study.

The present study introduces the experiment design for the updated CMIP6 dynamical downscaling; this project will provide updated and improved projections of future climate over New Zealand (Sect. 2, Methods). The primary focus of this study is then to comprehensively evaluate and quantify the historical biases and added value from the newly produced downscaled GCM/RCMs (Sect. 3, Results and Discussion). By comparing reanalysis-driven RCM biases with GCM-driven RCM biases, we gain insight into the origins of various biases. A comprehensive evaluation of extreme events is presented, including various temperature and precipitation-based indices and tropical cyclone frequency and intensity.

2 Methods

2.1 Overview of experiment design

The primary model used here for climate downscaling is the Conformal Cubic Atmospheric Model (CCAM) (version CCAM-2206) developed by the Commonwealth Scientific and Industrial Research Organisation (CSIRO) (Table 1). Further details about the specific CCAM configuration used are given in Sect. 2.2. Two other RCMs (WRF and UM) are also included in Table 1 as part of the broader downscaling project for New Zealand, namely, to compare biases across RCMs when driven by reanalyses and to compare the climate change signal between RCMs for select GCMs. Here we focus primarily on the historical downscaling and added value from CCAM, while this other reduced set of inter-RCM comparisons will be presented in a separate paper.

Table 1 Details for the downscaling experimental design, including the host models (GCM/reanalysis) and RCMs. Each RCM has an atmospheric model grid of approximately 12 km resolution over the New Zealand region. Models with asterisks (i.e. WRF and UM) are not analysed in this study but are included in the table to show data availability for the wider national climate downscaling effort. Further details of the model setup are given in Sect. 2. For ACCESS-CM2, r4i1p1f1 (instead of r1i1p1f1) was used due to data availability

Traditionally, most RCMs receive lateral boundary conditions from the host model (GCM or reanalysis) for a limited area domain, with the RCM evolving freely across the inner domain. Here, CCAM is instead run as a global atmospheric model with global spectral nudging across a stretched grid configuration. Spectral nudging (sometimes referred to as scale-selective downscaling) removes the need for pre-defined lateral boundaries and instead uses a cutoff in the spectral domain at a particular length scale (Thatcher and McGregor 2009). From this, the state of the regional atmosphere at large length scales is determined by the host model, while smaller scales (e.g. mesoscale circulation features) are allowed to evolve freely. The grid configuration in CCAM allows the placement of a high-resolution face centred over the domain of interest with gradually reduced resolution away from this region. Compared to limited area RCMs that are exposed to a step change in resolution at the domain boundary, this more seamless grid configuration allows coupling between the global and regional spatial scales on the same grid and may provide benefits for the representation of storms as they approach the domain of interest (Gibson et al. 2023). This approach can also alleviate important long-standing issues concerning the size and placement of the inner domain in limited area RCMs (Davies 1976; Jones et al. 1995).

Since CCAM is a global model, climate downscaling can either be performed through direct spectral nudging (Thatcher and McGregor 2009) to atmospheric fields from the host model (e.g. Grose et al. 2023) or through an AMIP-style simulation driven only by SST/SIC at the lower boundary (e.g. Hoffmann et al. 2016; Di Virgilio et al. 2020). Each approach (‘spectral nudging’ and ‘SST/SIC driven’) is known to have strengths and weaknesses. The spectral nudging approach is more consistent with standard climate downscaling performed through CORDEX and ensures relatively close consistency with the direct atmospheric fields from the GCM. The degree of consistency with atmospheric GCM fields can be further enhanced or relaxed through nudging parameters allowing smaller scale features to evolve at high resolution more freely. Alternatively, in the SST/SIC-driven approach, atmospheric conditions are simulated by CCAM itself as a global atmospheric model, while the rate of warming is largely determined by the SSTs from the host GCM. This approach offers an opportunity to reduce biases in the SST fields prior to the downscaling. For example, Hoffmann et al. (2016) found that bias correction of input SST fields prior to downscaling with CCAM improved the downscaled tropical precipitation climatology and the response to ENSO. Similarly, Di Virgilio et al. (2020) found that bias corrected SSTs in the output of the host GCM helped reduce CCAM biases in downscaled precipitation over Australia. Since New Zealand is an island nation where the surface air temperature over land is heavily influenced by regional SSTs (Gibson et al. 2023), bias correction of this nature can be particularly beneficial.

The output from each approach can be combined and considered part of an ensemble of regional climate projections (Grose et al. 2023). In this study, using CCAM, three GCMs are downscaled through direct spectral nudging to atmospheric conditions and another three GCMs are downscaled through bias-corrected SST/SIC-driven simulations (see Table 1 for details). The use of both approaches has also helped address important data availability issues from the CMIP6 output described in Sect. 2.3.

2.2 CCAM configuration details

CCAM-2206 was used in all simulations (Table 1) as a global atmospheric model with a stretched grid configuration (C288) and a Schmidt stretching factor of ~ 0.343. The Schmidt transformation determines the degree of stretching away from the region of highest resolution: this provided high-resolution over the target NZ domain (12 km) and relatively high-resolution over the wider South Pacific region (~ 12-35-km) (see Fig. 1). This grid configuration was shown to have several advantages for the New Zealand region (Gibson et al. 2023). CCAM adopts a non-hydrostatic, semi-Lagrangian dynamical core with a range of physical parameterizations. The CCAM radiation parameterizations (Freidenreich and Ramaswamy 1999; Schwarzkopf and Ramaswamy 1999) are based on the GFDL-CM3 radiation code with recent updates for CMIP6 radiative forcings. The cloud microphysics are based on a single moment prognostic cloud condensate scheme from the CSIRO-Mk3.6 model (Rotstayn 1997). Prognostic (single moment) aerosols were switched on and based on the CSIRO-Mk3.6 aerosol scheme with modifications for coupling to CCAM physical parameterisations. Turbulent mixing in the atmosphere is based on the k-e turbulence closure scheme following Hurley (2007). The convection scheme in CCAM implements a mass-flux closure scheme and includes downdrafts, entrainment, and detrainment (McGregor 2003). The land surface in CCAM adopts the CABLE land surface model (Kowalczyk et al. 2006).

Fig. 1
figure 1

Global variable resolution (i.e. stretched) grid used in all CCAM simulations. The high-resolution grid is centred over New Zealand (~ 12 km) while still providing relatively high resolution over much of the South Pacific (12–35 km). Note that the colour bar (units: km) is non-linear (panel a). Orography over New Zealand in CCAM (units: m a.s.l) (panel b). Figure 1a is adapted with permission from Gibson et al. (2023)

CCAM was run with 35 vertical levels in the atmosphere and a 4-minute dynamical timestep. For both the reanalysis and GCM fields, atmospheric spectral nudging was applied to 6-hourly surface pressure, winds and air temperature for levels between 850-hPa and 10-hPa. For downscaling reanalysis and GCMs, no nudging to water vapour is performed. The spectral filter nudging length (Thatcher and McGregor 2009) was set to a length scale of 3000-km. The reanalysis-driven run used atmospheric fields from ERA5 reanalysis (Hersbach et al. 2020) with daily SST/SIC fields from the Operational Sea Surface Temperature and Sea Ice Analysis (OSTIA, Donlon et al. 2012). Land use/land cover change were switched off.

The historical simulation period for downscaling each GCM spanned years 1960–2014, with an additional 1-year spin-up (year 1959). Historical greenhouse gas, aerosol, ozone, and solar forcings from the CMIP6 historical experiment were used. The reanalysis-driven simulation period spanned years 1982–2020 with 1-year spin-up (year 1981). The future simulation period for each GCM spanned years 2015–2099 across a range of future scenarios. Since the focus of this study is on historical evaluation, further details of these future projections will be provided in a separate paper. Each CCAM simulation was run independently on the Cray Kupe supercomputer at NIWA each using 22 XC50 compute nodes (880 physical CPU cores) generating approximately 3.5 simulation years per wall-clock day.

2.3 Selection of CMIP6 GCMs for downscaling

The computational cost of running RCMs typically dictates that a relatively small number of GCM/RCM pairs are used in downscaling. As is common in other downscaling projects (e.g. Evans et al. 2014; Jacob et al. 2020; Di Virgilio et al. 2022; Grose et al. 2023), the choice of which GCMs to downscale was made by balancing: (1) the regional performance of the GCM over the historical period; (2) model independence; (3) the rate of future warming. As mentioned above, a further practical consideration is data availability required to run the RCM, since several CMIP6 modelling groups did not provide the necessary sub-daily output.

The initial performance evaluation of CMIP6 GCMs was carried out following the approach for selecting CMIP5 GCMs in the previous downscaling exercise for New Zealand (Mullan and Dean 2009; Ministry for the Environment 2018) with updates for CMIP6. Model rankings were based on a comprehensive 62-element evaluation of the historical climate over the New Zealand region (120-210oE, 20 S-60oS) and wider South-West Pacific (90-240oE, 10oN-60oS), relative to ERA-Interim reanalysis. The four categories of indicators of climate and circulation features are:

  1. 1)

    The annual mean, seasonal cycle and the interannual standard deviation for mean sea level pressure, surface air temperature and precipitation. The Pearson pattern correlation and RMSE metrics were computed for the two domains. Combined this gives 36 elements.

  2. 2)

    The correlation between the Southern Oscillation Index (SOI) and mean sea level pressure, surface air temperature and precipitation. This is computed across two domains giving 6 elements.

  3. 3)

    Annual cycle in climatological mean sea level pressure differences used to diagnose regional circulation indices Z1, Z2, M1 (Trenberth 1976), and the SOI, giving 4 elements.

  4. 4)

    The position and the intensity of the winter and summer southern hemisphere zonal wind maxima and high-pressure belt maxima, computed across the two domains. Combined this gives 16 elements.

The ranking algorithm weights each of the above four categories to ensure relatively consistent performance across all aspects in the model selection. The ranking was applied to the historical period (years 1979–2014) for all available CMIP6 models and ensemble members (over 60 GCMs and 477 members in total). The intention of using multiple ensemble members was not to select the best performing member (for a given model) but rather to examine the sensitivity of the model selection to internal variability.

For the CCAM simulations with atmospheric nudging (detailed in Table 1), this complete set of indicators, across all four categories, was used in the ranking and GCM selection. The final ranking for individual models is presented in Supplementary Material Figure S1, with the models selected for downscaling highlighted (EC-Earth3, ACCESS-CM2, NorESM2-MM). As shown, the selected models are among the top performing CMIP6 models, and differences in the overall score between these selected models are relatively small after accounting for spread across ensemble members. Note that we did not simply select the three single highest scoring CMIP6 models due to important issues around data availability and model independence. In terms of availability, despite ranking well, TaiESM1 and GFDL-ESM4 did not provide all required 6hourly fields for downscaling so could not be used. In terms of model independence, different models from the same institution often performed similarly well in the rankings so only the top performing model was selected from these (e.g. the different configurations of the EC-Earth3 model).

For the CCAM SST/SIC-driven downscaling runs, since these are not nudged to the host model atmospheric conditions, the atmospheric fields from the host model are less relevant for evaluation. Instead, the primary consideration for model selection here was the surface air temperature features (i.e. annual mean, seasonal cycle, and interannual standard deviation). The ranking and selection of models based on this consideration is shown in Supplementary Material Figure S2. For this selection, data availability is not a limiting factor since all models have the required fields to drive CCAM. This enabled a more balanced consideration of diversity in the equilibrium climate sensitivity (ECS) and model independence alongside model performance. The two highest performing models were first selected (AWI-CM-1-1-MR and CNRM-CM6-1) followed by GFDL-ESM4. The selection of GFDL-ESM4 was made from the consideration of including an additional relatively low ECS model in the downscaled ensemble, while being relatively independent from the other top performing models (described below).

The ECS of the 6 models selected for downscaling are shown alongside those of the CMIP6 ensemble in Fig. 2. The 6 selected models span the overall CMIP6 model ECS range well, while additionally being constrained to the IPCC ‘very likely range’ for ECS (between 2 and 5oC). Our selection has ensured that both the lower and higher end of the CMIP6 ECS ‘very likely’ range are included. This resulted in the exclusion of certain ‘hot models’ that fell outside this range (Hausfather et al. 2022), which otherwise performed reasonably well in the historical evaluation (e.g. UKESM1-0-LL). The mean ECS from the six downscaled models is similar to, albeit slightly lower than, the CMIP6 mean ECS (3.69oC versus 3.77oC, respectively). While ECS describes stabilized warming at equilibrium in a global mean sense, these 6 selected models also sample well the range of CMIP6 warming rates when assessed in terms of transient warming over the New Zealand region (Supplementary Material Figure S3).

Fig. 2
figure 2

Spread of ECS across CMIP6 models (grey bars) alongside those selected for downscaling with CCAM (blue bars). The grey circle is the ECS mean of all available CMIP6 models (n = 52), the blue circle is the ECS mean of the downscaled CMIP6 models (n = 6). The black line represents the IPCC AR6 ‘likely range’ (ECS between 2.5 and 4oC) and the red line represents the ‘very likely range’ (ECS between 2 and 5oC). Models with asterisks indicate where CCAM is driven by the SST/SIC fields from the host GCM without atmospheric nudging. ECS methodology and values are from Zelinka et al. (2020)

Adopting the concept of ‘families’ of models within CMIP (Knutti et al. 2013; Abramowitz et al. 2019; Brunner et al. 2020), model independence was assessed qualitatively based on an a priori knowledge of each model. This includes obvious institutional dependencies (e.g. the different variants of the EC-Earth3 model) as well as known sharing of major model components across models from different institutions (e.g. ACCESS-CM2 and UKESM1-0-LL models share the same underlying atmospheric model). As described in the model independence framework of Brunner et al. (2020) each of the 6 selected GCMs can be considered belonging to an overall separate ‘family’. While all 6 GCMs have notably different atmospheric models, in certain cases their ocean models have some dependencies. Namely, ACCESS-CM2 and GFDL-ESM4 implement different versions of the Modular Ocean Model (MOM) while CNRM-CM6-1 and EC-Earth3 both implement version 3.6 of the Nucleus for European Modelling of the Ocean (NEMO). In the context of how these GCMs are downscaled here, the importance of this ocean model dependency is somewhat reduced since bias correction of SST fields is first performed on GFDL-ESM4 and CNRM-CM6-1 models within the SST/SIC-driven approach to downscaling (see Table 1). Furthermore, although not explicitly considered as part of the selection process, the six models chosen here also span a wide range of the larger CMIP6 ensemble in terms of temperature and circulation climatological fields between CMIP6 models as quantified in Brunner et al. (2020).

2.4 Bias adjustment of SST/SIC-driven runs

As detailed above (Table 1), when CCAM is driven only from the SST/SIC fields of the GCM (i.e. for downscaling AWI-CM-1-1-MR, CNRM-CM6-1, and GFDL-ESM4) bias correction of the input SST and SIC fields was first performed.

2.4.1 SST adjustment

The reduction of systematic biases in prescribed SST and sea-air fluxes of heat and moisture can improve the representation of large-scale circulation and precipitation in atmospheric models (Hoffman et al. 2016; Nguyen et al. 2012; Ashfaq et al. 2011; Chapman et al. 2023); this can also reduce the inter-model spread of future climate projections (He and Soden 2016). We reduced biases in the climatological mean and interannual variance of monthly SST from CMIP6 models, prescribed to CCAM, using a method similar to Hoffman et al. (2016), hereafter H2016. The adjustments for each CMIP6 model were derived with reference to SST observed from 1982 to 2014 using OSTIA data. OSTIA compares well to other global observational SST data sets (Yang et al. 2021) and has been used as surface boundary forcing in recent atmospheric reanalyses over Australasia (Su et al. 2021) and a free-running CCAM historical ensemble targeting the southwest Pacific region (Gibson et al. 2023). While bias-correcting the CMIP6 SST, several steps were taken to avoid over-adjusting the variance and to preserve the long-term trends from the models, similar to H2016. In particular, linear trends and non-linear multidecadal variations were temporarily removed, the SST variance was then adjusted by a correction factor between 0.1 and 3.0, and the linear trends and multidecadal variations were subsequently restored. This variance correction factor was derived from the ratio of the standard deviations of observed and modelled SST, and was limited to the range 0.1-3.0 to avoid making very large adjustments to the model SST variability. In certain cases, the correction of SST variability at high latitudes may be impaired by mismatches between modelled and observed sea ice cover. Therefore, and similar to H2016, the variance correction was linearly relaxed between latitudes, from full adjustments at the equator to zero adjustment poleward of 50 °N and 60 °S. Thus the variance correction was extended slightly further poleward in the Southern Hemisphere, where sea ice is climatologically concentrated slightly further poleward, and this allowed additional correction of SST variability across NZ latitudes (~ 34–47 °S).

2.4.2 SIC adjustment

In addition to SST, CCAM also requires global prescribed sea ice concentrations (SIC). We cannot take the SIC directly from the CMIP6 models because it would be physically inconsistent with the bias-adjusted SSTs (described above) and because there can be large biases in SIC in the CMIP6 models. Instead, we use the statistical algorithm based on Stone and Pall (2021), hereafter SP2021, to estimate the SIC field directly from the bias-adjusted SST field, described above. Whereas SP2021 used the algorithm to estimate perturbations in an SIC field provided perturbations in an SST field, here we use it as a direct predictor of SIC. The first step involves calculating a relationship between SST and SIC based on observed values in OSTIA during the 2001–2010 period, separately for each hemisphere. We take monthly mean values at all grid cells within the hemisphere, select all cells with SIC in one of 100 evenly spaced bins (from no-ice to full-ice), and take the median SST value. In contrast to SP2021, here we perform the calculation separately for each calendar month, and instead of fitting a linear function to the median SSTs from each of 100 SIC bins we smooth the relationship in the 100 SIC bins into a monotonic function of SST. This function can then take any SST value from any grid cell from a GCM and translate it to a unique SIC value. The resultant Southern Hemisphere SIC tends to have the same mean over the 1982–2014 period but with more interannual variability, while future trends in SIC are preserved.

2.5 Added value metrics

In the context of RCM evaluation, added value refers to the reduction in biases in RCM output relative to the host GCM output at the regional scale. This is highly relevant since one of the main motivations for performing downscaling is to reduce important GCM biases at these finer scales. Here added value is quantified through various statistics, each applied to climatological (i.e. annually or seasonally time-averaged) fields over New Zealand land-based grid cells. These are: root-mean-square error (RSME), mean absolute error (MAE), mean absolute percentage error (MAPE) and Pattern Correlation. For each of these metrics, the percentage difference between the RCM and GCM error was computed, where positive values indicate a reduction in the error from the RCM (i.e. added value).

Analysing a range of added value metrics is beneficial since they account for different aspects of model performance. RMSE and MAE both penalize heavily for the magnitude of the errors while MAPE penalizes based on the magnitude of percentage errors, the latter useful for characterizing errors across regions of relatively low climatological precipitation. In contrast, pattern correlation penalizes based on differences in the spatial distribution of the climatology. Following Di Virgilio et al. (2020), a final added value metric was included to account for the overall spatial coverage of added value, which we refer to as Land%. This is defined as the fraction of all grid cells where the RCM shows added-value defined by MAE, scaled to range between − 50 and + 50. As such, a Land% value > 0 indicates that the RCM added value covers more than half of the total land area, while a value of 50 indicates that the RCM added value covers all land area. Climatological fields of precipitation, daily maximum air temperature (tasmax), and daily minimum air temperature (tasmin) were assessed for added value for each GCM/RCM, along with a seasonal breakdown (annual, summer, winter). Added value was then extended to extreme events in each GCM/RCM by evaluating select temperature and precipitation-based indices (Table 2) from the Expert Team on Climate Change Detection and Indices (ETCCDI, Zhang et al. 2011). This enabled different parts of the distribution to be evaluated as well as duration-based considerations (e.g. the length of wet and dry spells).

Table 2 Extreme event ETCCDI indices evaluated across models in this study

The reference dataset for assessing added value was from the daily Virtual Climate Station Network (VCSN) (~ 5-km grid, Tait et al. 2006; Tait et al. 2012; Tait and Macara 2014). The gridded daily VCSN product is constructed based on spatial interpolation of an extensive network of station data across New Zealand. The interpolation uses a second-order trivariate thin-plate smoothing spline. Additional information on location and climatological fields are used in the interpolation. The updated ‘Norton adjusted’ version of VCSN was used for tasmax and tasmin which improves temperature estimates in high elevation regions (Tait and Macara 2014). The updated ‘augmented’ version of VCSN was used for precipitation which includes a much larger number of stations in the final interpolated product (approximately 1200), as discussed in Tait et al. (2012). For assessing added value across the common overlapping years of 1982–2014, all products (i.e. VCSN, reanalysis and models) were regridded to the 12-km CCAM grid over New Zealand using conservative regridding. Regridding was performed after the computation of extreme indices (e.g. Gibson et al. 2019).

2.6 Circulation regimes

Circulation regimes were analysed to evaluate relationships between a range of synoptic circulation conditions and precipitation variability in the RCM output. This adds additional understanding of model biases on top of the climatological added value evaluation. Circulation regimes were defined following the approach of Rampal et al. (2022b) applied to ERA5 daily 1000-hPa geopotential heights (z1000), where the seasonal cycle and trend were first removed. In this approach affinity propagation is first applied to determine an ‘optimal’ number of clusters (n = 9), from which K-means clustering is applied to the first three empirical orthogonal functions (EOFs) computed over the New Zealand domain.

In this approach, the selection of 9 clusters was recommended by Rampal et al. (2022b), which highlighted that fewer than the 12 clusters originally identified by Kidson (2000) were needed to capture the main features of New Zealand’s circulation variability. It is important to note that many of the clusters obtained in Rampal et al. (2022a) closely resemble those of Kidson (2000). The first three EOFs were chosen because they accounted for 85% of the total variance in z1000. Sensitivity testing to a larger number of EOFs produced very similar clusters and the evaluation of CCAM precipitation composites remained very similar (not shown), highlighting that the results are largely insensitive to these methodological choices. For further discussion on the clustering methodology and comparisons with Kidson (2000), the reader is referred to Rampal et al. (2022b). The circulation regimes defined from ERA5 data were applied to daily GCM z1000 fields to assign a daily circulation regime. From this we assessed biases in the GCM circulation regime frequency and the association (i.e. composites) with RCM simulated precipitation.

2.7 Cyclone tracking

Tropical cyclones and associated ex-tropical transitions were evaluated in each GCM and RCM (CCAM) pair. This allows for an assessment of whether CCAM improves the representation of these storms relative to the host GCM. Cyclone tracking was performed through the TempestExtremes v2.1 tracking algorithm (Ullrich et al. 2021). Following Gibson et al. (2023), 6-hr mean sea level pressure (MSLP) was tracked by identifying a local minimum in the MSLP field. A single closed contour criterion is specified, based on MSLP increasing by 200-Pa over a 6.0-degree great circle distance outward from the candidate point minima, which ensures the low-pressure region is sufficiently strong and compact to be considered a coherent feature. Candidates are eliminated if another stronger MSLP minimum exists within a 6.0-degree great circle distance. As these candidate features are tracked, additional criteria are imposed: a track must persist for at least 60 h to be considered, and the maximum distance travelled between subsequent detections is 6.0-degrees. We focus our evaluation on subsets of cyclones based on genesis location latitude (between 0-25oS) and in terms of strength defined by MSLP along the track. Since tracking parameter choices can influence the overall cyclone frequency (Ullrich et al. 2021), we also carried out sensitivity testing of tracking parameters on CCAM output including the use of a warm core criteria.

The climatology of tropical cyclone count, tracks, and maximum intensity were assessed relative to IBTrACS Version 4 (Knapp et al. 2018) and SPEArTC (Diamond et al. 2012) over the southwest Pacific. The use of two reference products enabled an assessment of observational uncertainty in the context of model evaluation. SPEArTC can be considered an update and reanalysis of IBTRACS specific to this basin (Magee et al. 2016); it includes ex-tropical cyclones in the database which are important for rainfall extremes over New Zealand (e.g. Harrington et al. 2023).

3 Results and discussion

3.1 Climatological added value

We begin by investigating added value from CCAM in the context of downscaling ERA5 reanalysis and a single GCM, ACCESS-CM2 (Fig. 3 and Supplementary Material Figure S4). Later we extend this added value analysis to include all downscaled GCMs (Table 1) including for different variables, seasons and error metrics. With the assumption that the large-scale fields are well represented in ERA5, comparing downscaled biases between ERA5 and the GCMs is useful for shedding light on different sources of model bias (e.g. large-scale circulation induced versus RCM physics).

Fig. 3
figure 3

Assessment of added value for annual mean precipitation climatology (years 1982–2014, units: mm/year) from downscaling ERA5 with CCAM and from downscaling ACCESS-CM2 with CCAM. Top row shows the annual climatology for each product. Middle row shows the biases including various bias statistics. Bottom row shows the added value of CCAM: shaded values show the absolute value of the reduction in biases (positive values indicate reduced biases), with bias statistics showing the percent improvements (positive values) after downscaling. Land % is a measure of the spatial coverage of the added value (see Methods section for details). This figure is reproduced in Supplementary Material Figure S4 showing regions of negative added value

3.1.1 Precipitation

For downscaling precipitation, the top row of Fig. 3 compares the annual precipitation climatology relative to VCSN (reference) for: ERA5, ERA5 downscaled by CCAM, ACCESS-CM2, and ACCESS-CM2 downscaled by CCAM. Prior to downscaling, both ERA5 and ACCESS-CM2 substantially underestimate precipitation in high elevation regions. This underestimation is most apparent across the Southern Alps of the South Island, where the highest average annual precipitation totals are in the 3–5 m range for ERA5 and ACCESS-CM2, in contrast to approaching 12 m in VCSN. This underestimation is also apparent in the North Island in both ERA5 and ACCESS-CM2 with a lack of orographic enhancement in the precipitation climatology. After downscaling with CCAM, this high-elevation dry bias issue is considerably reduced, where the annual precipitation climatology is in much better agreement with VCSN.

The added value is more formally quantified by comparing the magnitude of climatological biases in the middle and bottom rows of Fig. 3 for various error metrics. For example, the annual climatological precipitation RMSE is reduced by around 30% when downscaling ERA5 and by around 45% when downscaling ACCESS-CM2. The improved representation of high elevation precipitation also results in substantially improved spatial pattern correlations, which increases from 0.52 to 0.92 after downscaling ACCESS-CM2 with CCAM. As noted earlier, comparing error metrics and the spatial patterns of biases between reanalysis and GCM-downscaled fields can shed light on different sources of model bias. As expected, when aggregated across the country, errors are slightly larger when downscaling ACCESS-CM2 compared to downscaling ERA5 (Fig. 3, middle row). This indicates that circulation biases induced by the GCM have partly contributed to the final downscaled precipitation biases. Interestingly, a dry bias across parts of the Southern Alps when downscaling ERA5 with CCAM is switched in sign to wet bias when downscaling ACCESS-CM2. In Sect. 3.2 we investigate this in greater detail from a circulation regimes perspective. On the other hand, certain regional biases are evident across both downscaled products indicating important biases induced by CCAM itself. A notable example of this is in the lee of the Southern Alps (i.e. across the eastern South Island) where CCAM consistently displays a positive wet bias relative to VCSN.

The spatial maps in the bottom row of Fig. 3 indicate that the reduction in biases from downscaling are not uniform across the country (i.e. positive values indicate a reduction in biases, while grey values indicate no change or enhanced biases). Regions of negative added value (i.e. enhanced biases) are further highlighted in Supplementary Material Figure S4. As expected, high elevation regions are where downscaling with CCAM shows the largest reduction in biases (Fig. 3). The most likely explanation for this is that low resolution models (i.e. around 100 km resolution) have insufficient representation of orography so lack orographic enhancement (e.g. Renwick et al. 1998). It is notable that ERA5 (prior to downscaling, ~ 30 km resolution) also has a large dry bias over the Southern Alps, also reported by Pirooz et al. (2021). Rain gauge data are not directly assimilated into ERA5 (Lavers et al. 2022), therefore the biases here cannot simply be attributed to the low station density in this region. This suggests that the driving atmospheric model’s low resolution (i.e. relative to the terrain height and complexity) and physics parameterizations in ERA5 have contributed to the dry bias.

While a considerable reduction in biases is evident in many regions (particularly high elevation regions), added value from downscaling via CCAM is not apparent across the entire country for precipitation. When downscaling ACCESS-CM2, the Land% metric (10.22) indicates that just over 60% of the country shows improvements after downscaling. As noted above, the lee of the Southern Alps is a large region where CCAM appears to struggle to reduce biases from the GCM, suffering from a regionally consistent wet bias. The general spatial pattern of biases, especially the wet bias in the lee of the Southern Alps, is evident when downscaling other GCMs (e.g. EC-Earth3, Supplementary Material Figure S5). Interestingly, this same regional wet bias is shown across high resolution convective-scale forecasts and regional reanalysis products (~ 1.5-km resolution) in Pirooz et al. (2023), indicating that this issue is not unique to CCAM and not necessarily resolved at higher resolution.

3.1.2 Tasmax

The added value for daily maximum air temperature (tasmax) is shown in Fig. 4 and Supplementary Material Figure S6. As was the case for precipitation, there is a clear lack of spatial variability in the temperature climatology in high elevation regions prior to downscaling. This is particularly notable in ACCESS-CM2 with a complete lack of spatial variation in temperature by elevation. Due to the magnitude of these biases, it is unsurprising that CCAM shows relatively large added value. For example, after downscaling, RMSE and MAE are reduced by over 30%, and the spatial pattern correlation improves from 0.69 to 0.93. In the case of tasmax, the added value is more uniformly spread across the country than for precipitation (Fig. 4, bottom row).

Fig. 4
figure 4

Assessment of added value for annual mean tasmax (daily maximum temperature) climatology (years 1982–2014, units: oC) from downscaling ERA5 with CCAM and from downscaling ACCESS-CM2 with CCAM. Top row shows the annual climatology for each product. Middle row shows the biases including various bias statistics. Bottom row shows the added value of CCAM: shaded values show the absolute value of the reduction in biases (positive values indicate reduced biases), with bias statistics showing the percent improvements (positive values) after downscaling. Land % is a measure of the spatial coverage of the added value. This figure is reproduced in Supplementary Material Figure S6 showing regions of negative added value

It is noteworthy that when downscaling ERA5 with CCAM, similarly large reductions in tasmax biases are evident. In particular, RMSE and MAE have reductions in the 30–40% range and the added value is spread across over 70% of the country. This is an impressive result, since ERA5 directly assimilates near-surface air temperature observations and is therefore more difficult to improve upon through downscaling. For example, across Australia, Di Virgilio et al. (2020) reported a general lack of added value for temperature fields from RCMs driven by reanalysis.

When comparing ERA5 downscaled by CCAM with ACCESS-CM2 downscaled by CCAM (Fig. 4 middle row) the consistency in the spatial pattern and magnitude of the biases suggests that large-scale circulation biases inherited from the GCM are not the primary source of CCAM tasmax biases. All other downscaled GCMs show a very similar spatial pattern in the bias for tasmax (not shown), despite some differences in the overall magnitudes for certain models. One notable difference is for EC-Earth3 which is approximately 1oC warmer in the downscaled climatology averaged across New Zealand (Supplementary Material Figure S7). Relative to VCSN, CCAM appears to induce a warm bias across parts of the eastern North Island and the top of the South Island. These warm biases are similar in location and magnitude to those described in Gibson et al. (2023) when CCAM was driven by observed SST (no atmospheric nudging), which again suggests that these biases are driven by CCAM at the local scale. These results are consistent with other studies showing that RCMs can generate their own internal biases in near-surface air temperature independent of the driving GCM biases (e.g. Di Virgilio et al. 2020). As discussed in Gibson et al. (2023) through sensitivity testing, the spatial pattern of this apparent bias may be related to roughness length values in CCAM for tall forest canopy regions. Inconsistencies may also arise given the assumption that observational measurements of temperature are made in clearings in these regions. The apparent cold bias in CCAM across the central South Island was also found in Gibson et al. (2023). As described there, VCSN tends to have an apparent warm bias in this high-elevation region which may partly contribute to the apparent discrepancy with CCAM.

3.1.3 Tasmin

In comparison to precipitation and tasmax, tasmin generally displays the largest added value from downscaling (Fig. 5 and Supplementary Material Figure S8). For the case of downscaling ACCESS-CM2, RMSE and MAE decrease by more than 80% (Fig. 5, middle row) and added value is spread across almost the entire country (Fig. 5, bottom row). For the case of downscaling ERA5, error reductions are somewhat smaller but consistent in direction and are also spatially widespread. As was the case for tasmax, the similarity of tasmin biases when downscaling ERA5 and ACCESS-CM2 again suggest that large-scale circulation biases from the GCM have made only a relatively small contribution to this. The small magnitudes of the tasmin biases in both cases are generally impressive, with climatological MAE aggregated across the country of ~ 0.7oC.

Fig. 5
figure 5

Assessment of added value for annual mean tasmin (daily minimum temperature) climatology (years 1982–2014, units: oC) from downscaling ERA5 with CCAM and from downscaling ACCESS-CM2 with CCAM. Top row shows the annual climatology for each product. Middle row shows the biases including various bias statistics. Bottom row shows the added value of CCAM: shaded values show the absolute value of the reduction in biases (positive values indicate reduced biases), with bias statistics showing the percent improvements (positive values) after downscaling. This figure is reproduced in Supplementary Material Figure S8 showing regions of negative added value

3.1.4 Summary across models

A summary of added value is presented in Fig. 6, comparing across variables, GCMs, error metrics and seasons. The heatmap displays percentage reductions in biases after downscaling, whereby red positive values indicate a reduction in bias (i.e. added value from downscaling), while blue negative values indicate an enhancement in bias.

Fig. 6
figure 6

Summary of added value for precipitation (panel a), tasmax (panel b) and tasmin climatology (panel c) across multiple host models, metrics and seasons. ANN = annual, DJF = summer, JJA = winter. Positive % values indicate added value after downscaling. Models with asterisks indicate where CCAM is driven only by the bias-corrected SST/SIC fields from the host GCM. Statistics are computed across all New Zealand grid cells over land

As noted earlier for ACCESS-CM2, the overall added value is typically largest for tasmin compared to other variables. As shown in Fig. 6c, the large reductions in tasmin biases, particularly MAE and RMSE, that were shown earlier for ACCESS-CM2 are consistent across models and seasons. The only exception for this is when downscaling ERA5, which as discussed earlier is expected to be challenging for temperature (compared to precipitation) since ERA5 directly assimilates surface temperature data. It is also encouraging that for tasmax, in almost all cases, there is consistent added value across models, seasons, and error metrics. The only exception to this is for EC-Earth in summer where tasmax biases increase slightly for certain metrics (i.e. MAE and Land%).

For precipitation, while there is generally added value shown across models (Fig. 6a), a notable exception is in winter for the models where CCAM is driven by SST/SIC fields (i.e. AWI-CM-1-1-MR, GFDL-ESM4, CNRM-CM6-1). For these models, CCAM generally enhances biases in winter precipitation compared to the host model (although the spatial pattern correlation is still consistently improved after downscaling). Individual inspection of these model precipitation biases (e.g. GFDL-ESM4, Supplementary material Figure S9) reveals a consistent pattern of bias across SST/SIC driven models characterised by a wet bias in winter precipitation across the Southern Alps. This bias, and the associated reduction in added value, is limited to winter precipitation, and does not impact other seasons (i.e. MAM, SON, not shown).

Comparisons to the CCAM biases in Gibson et al. (2023), whereby CCAM was forced by observed SST/SIC (OSTIA), provide useful context for explaining these winter precipitation biases. The biases from that study provide an upper bound for how CCAM is expected to perform in this study when driven by bias-corrected SST/SIC fields from GCMs. As discussed in Gibson et al. (2023), when driven by observed SST/SIC, CCAM has a tendency for too strong winter ridging over Southern Australia and the Tasman Sea and an associated too strong winter jet and storm track activity, particularly affecting the south of the South Island (Supplementary Material Figure S10), consistent across initial condition ensemble members. These same circulation biases are shown to be very similar across the SST/SIC driven CCAM runs here, with the magnitude increased slightly here (Supplementary Material Figure S11). In Gibson et al. (2023) evidence was presented to suggest that these circulation biases over the New Zealand region may be related to Rossby wave train biases induced by remote deep convection biases in certain regions of the tropics in CCAM. Given the similarities, it seems plausible that this is also partly responsible for the circulation biases here which subsequently drive the wet bias in high elevation precipitation across the Southern Alps. Since this approach to dynamical downscaling (i.e. CCAM driven by SST/SIC) is also commonly performed by other studies (e.g. Chapman et al. 2023; Grose et al. 2023), we suggest that these findings carry wider relevance for downscaling applications in other regions and settings. As such, similar process-based investigation into large-scale circulation biases and their causes in CCAM is warranted. Testing the sensitivity of these biases to the grid configuration and physics schemes in CCAM would be particularly useful.

3.2 Circulation regimes

The above model evaluation considered precipitation only in a climatological sense. Here we present a more comprehensive evaluation of CCAM precipitation through the perspective of circulation regimes. To remove any precipitation biases induced by large scale circulation biases, here we focus on the ERA5 reanalysis-driven CCAM simulation.

Precipitation composites are shown by circulation regime in Fig. 7 comparing VCSN (top panels) and CCAM (bottom panels). Note that the circulation regime composites themselves are almost identical, as both are driven by ERA5. Very minor differences can however arise due to how CCAM modifies the circulation fields from ERA5 through spectral nudging. Instead, we focus on how the precipitation composites compare between VCSN and CCAM across circulation regimes, quantified through the spatial pattern correlation assigned to each regime. As expected, the presence of widespread cyclonic conditions (namely the L and LSW regimes) typically results in large widespread precipitation totals over much of the country which is reproduced well by CCAM (pattern correlation of 0.86–0.92). Strong zonal westerly flow conditions (most apparent under the HW circulation regime) results in a large west/east gradient of precipitation associated with orographic enhancement over the Southern Alps and dry conditions to the lee. This is also well captured by CCAM (Pattern correlation of 0.93).

Fig. 7
figure 7

Circulation regimes (shaded contours over ocean, z1000) and associated composite precipitation (shaded land values) for Reference (ERA5/VCSN, upper panels) and ERA5 (CCAM) (lower panels). Pattern correlations for the spatial agreement of precipitation (CCAM versus VCSN) are shown for each circulation regime

In terms of weaknesses of CCAM, as noted earlier, CCAM tends to produce too much climatological precipitation in regions to the lee of the Alps. From the perspective of circulation regimes, we can see that this regional wet bias manifests mostly through synoptic northerly and north-easterly flow conditions (i.e. the HSE and LNW regimes) as opposed to westerly flow conditions. Nevertheless, the general finding of highly consistent agreement in these precipitation composites further suggests that CCAM can represent precipitation in a spatially and physically correct manner under typical synoptic circulation regimes. Despite not being commonplace in RCM evaluation studies, we suggest that this type of circulation regime-based evaluation is a valuable addition to model evaluation, providing additional insights into model biases beyond simple climatological statistics.

The frequency of circulation regimes in the GCMs can also be useful for further understanding aspects of the downscaled climatological precipitation biases presented earlier. For example, when comparing CCAM downscaled precipitation from ACCESS-CM2 against ERA5, ACCESS-CM2 had a significant increase in precipitation across the Southern Alps. One possible explanation for this is that zonal circulation regimes associated with westerly or north-westerly flow conditions (namely LSW, HW, HSE), shown to produce large precipitation totals over the Alps (Fig. 7), occur too frequently in ACCESS-CM2. There is some evidence for this shown in Fig. 8, where both the LSW and HSE regimes occur around 5% and 10% more frequently, respectively. Another apparent cause for the wet precipitation bias in ACCESS-CM2 is that there is also too much precipitation occurring under these key circulation regimes, especially for the north to north-westerly flow conditions under the HSE regime (Supplementary material Figure S12). The circulation regime frequency analysis (Fig. 8) also reveals that both the SST/SIC driven CCAM runs and the nudged CCAM runs have overall roughly similar biases in regime frequency on an annual basis. This is encouraging given the tendency for an overly strong winter jet and storm track bias in the SST/SIC driven CCAM runs, as discussed earlier. While the focus here has been on historical evaluation, it would be interesting in future work to assess future projections of these circulation regimes in CMIP6, building upon prior regional attribution and CMIP5 studies (e.g. Gibson et al. 2016; Thomas et al. 2023), including from this subset of downscaled projections.

Fig. 8
figure 8

Annual cluster regime frequency in each model. Panel a refers to the frequency of occurrence; panel b refers to the frequency (as a relative percentage) against ERA5. Cluster regimes relate to those shown in Fig. 7

3.3 Inter-annual variability

There is also interest in whether the downscaled output from CCAM appropriately captures inter-annual variability in temperature over land. Although New Zealand is an island country, where temperature variability over land is strongly related to SST variability, land surface feedbacks may also be important in certain inland regions, particularly for temperature extremes in late summer (Harrington 2021). Since SSTs are prescribed to CCAM in these simulations, changes in inter-annual variability in land temperature through downscaling (i.e. relative to the host GCM) is likely indicative of the role of the land surface model. Furthermore, for the SST/SIC-driven CCAM simulations, since bias correction is first performed on the SSTs, changes in inter-annual variability in temperature after downscaling are useful for assessing how the bias correction has influenced temperature variability over land.

In the nudged simulations for tasmax and tasmin in summer (Fig. 9), there is relatively little change in variability after downscaling. Depending on the host GCM, downscaling can either slightly increase or decrease inter-annual variability, but these changes are small relative to the overall differences between models. As expected, this implies that differences in temperature variability between the host GCMs have a strong first order control on downscaled temperature variability over land. For the SST/SIC driven simulations (marked by asterisk) downscaling has a much larger influence on inter-annual variability, which is very likely due to the bias and variance correction of SSTs performed prior to downscaling (see Sect. 2.4 for details). Notably, this change in variability is consistently in the direction of improving variability relative to VCSN. Overall, it is encouraging that for both nudged and SST/SIC driven simulations CCAM appears to produce similar temperature variability to VCSN. Similar results were found for other seasons (not shown).

Fig. 9
figure 9

Summer (DJF) interannual variability (years 1972–2014, units: oC) of New Zealand-averaged (land grid cells) tasmax and tasmin before downscaling (grey bars, GCM) and after downscaling (blue bars, CCAM). Dotted horizontal line shows the VCSN (reference). Models with asterisks indicate where CCAM is driven only by the bias-corrected SST/SIC fields from the host GCM.

3.4 Extreme events

Next, we describe added value in the context of extreme events through analysis of ETCCDI indices. As before, we begin with an in-depth evaluation of added value for reanalysis (ERA5) alongside a single GCM (ACCESS-CM2) for select indices before summarizing across all GCMs and indices.

3.4.1 Rainfall extremes

For the climatology of the annual wettest 3 consecutive days (Rx3day) (Fig. 10 and Supplementary Material Figure S13) the spatial patterns of biases are broadly consistent and scale with the biases described earlier for mean precipitation. For example, for the reanalysis-driven simulation, there is a tendency for a dry bias in extreme rainfall in high elevation regions, and a wet bias in the lee of the Alps. Regarding the dry bias, the maximum Rx3day in CCAM is 478 mm and 572 mm when driven by ERA5 and ACCESS-CM2, respectively, while the maximum Rx3day in VCSN is 720 mm. However, despite these biases, this still constitutes a large improvement in Rx3day relative to ERA5, in which the maximum Rx3day is less than 300 mm (i.e. dry biased) and the mean is also considerably dry biased. The apparent wet bias for CCAM rainfall extremes in the lee of the Alps might be somewhat inflated due to issues with VCSN underestimating rainfall extremes in this region in certain historical events (Stone et al. 2022). One region of relatively large improvements for CCAM is across the eastern North Island where the Rx3day are a close match to VCSN but are substantially underestimated in ERA5 and ACCESS-CM2 by as much as 50%. As shown by Cyclone Gabrielle recently (Harrington et al. 2023), this region can achieve significant multiday rainfall totals from ex-tropical cyclones with devastating consequences, making CCAM’s improvement over the global products an important example of added value. Similar results (i.e. spatial patterns of biases and added value) are found for single day extreme rainfall analysis from Rx1day (Supplementary Material Figure S14).

Fig. 10
figure 10

Assessment of added value for annual Rx3day (wettest 3 consecutive days) precipitation climatology (years 1982–2014, units: mm) from downscaling ERA5 with CCAM and from downscaling ACCESS-CM2 with CCAM. Top row shows the annual climatology for each product. Middle row shows the biases including various bias statistics. Bottom row shows the added value of CCAM: shaded values show the absolute value of the reduction in biases (positive values indicate reduced biases), with bias statistics showing the percent improvements (positive values) after downscaling. Land % is a measure of the spatial coverage of the added value (see Methods section for details). This figure is reproduced in Supplementary Material Figure S13 showing regions of negative added value

3.4.2 Temperature extremes

For the climatology of the annual hottest day (TXX) (Fig. 11 and Supplementary Material Figure S15) the spatial patterns of biases from CCAM are again generally consistent with those of the daily tasmax climatology shown earlier. Notably, there are regions where CCAM tends to overestimate the hottest day of the year by more than 2°C, namely across the top of the South Island and across eastern North Island. Conversely, CCAM tends to underestimate the hottest day of the year over inland regions of the South Island, though as discussed earlier there is also considerable observational uncertainty across this region. Since similar spatial patterns of biases are evident across both ERA5 and other GCMs (not shown) this suggests that these biases in TXX are again driven predominantly by CCAM at the local scale and less so by specific biases in the host GCM. Despite these biases, for downscaling both ERA5 and ACCESS-CM2 there is generally large added value with reductions in RMSE and MAE in the range of 25–44% when aggregated across New Zealand. As indicated by the Land% metric, the spatial consistency of added value for temperature-based extremes is generally larger (i.e. more widespread) compared to that for precipitation-based extremes shown earlier (Fig. 10).

Fig. 11
figure 11

Assessment of added value for annual TXX (hottest day) climatology (years 1982–2014, units: units: oC) from downscaling ERA5 with CCAM and from downscaling ACCESS-CM2 with CCAM. Top row shows the annual climatology for each product. Middle row shows the biases including various bias statistics. Bottom row shows the added value of CCAM: shaded values show the absolute value of the reduction in biases (positive values indicate reduced biases), with bias statistics showing the percent improvements (positive values) after downscaling. Land % is a measure of the spatial coverage of the added value (see Methods section for details). This figure is reproduced in Supplementary Material Figure S15 showing regions of negative added value

3.4.3 Summary of ETCCDI

A summary of added value across GCMs and ETCCDI indices is presented in Fig. 12 based on climatological RMSE comparisons. From this, widespread evidence of added value is apparent from downscaling. Generally, the largest added value is for temperature-based indices, especially daily temperature range (DTR) and frost days (FD). For both of these indices, the host GCMs have a strong tendency to underestimate magnitudes (i.e. overall too few frost days and insufficient diurnal temperature range, not shown) which are greatly improved upon in the downscaling by CCAM across GCMs. Despite certain biases described above for precipitation extremes, added value from downscaling is also consistent across GCMs for the various indices analysed (i.e. Rx1day, Rx3day, Rx5day, R95p).

Fig. 12
figure 12

Summary of added value for ETCCDI temperature and precipitation extremes (positive % values indicate added value after downscaling) across multiple models. Models with asterisks indicate where CCAM is driven only by the bias-corrected SST/SIC fields from the host GCM. Statistics are computed across all New Zealand grid cells over land

Generally, the smallest or degraded added value is for consecutive wet days (CWD) and consecutive dry days (CDD). For these indices, there are single model examples (i.e. for NorESM2-MM and AWI-CM-1-1-MR GCMs) where the downscaling from CCAM has not lead to overall improvements. Although not shown, CCAM tends to slightly overestimate CWD in most locations (i.e. wet spells are too long) and underestimate CDD (i.e. dry spells are too short). When driven by reanalysis and aggregated across the country, CCAM tends to also overestimate the number of wet days (R1mm) and slightly underestimate the average rainfall on wet days (SDII). This implies that the biases in CWD and CDD shown here are driven by too frequent low-intensity rainfall events in CCAM. A similar finding was reported in Gibson et al. (2023), where CCAM was not nudged to atmospheric conditions from GCMs, to suggest that these biases in these indices are generated by CCAM at the local scale. It is also noteworthy that the SST/SIC-driven models in Fig. 12 show an overall similar degree of added-value as the nudged runs, highlighting the combined usefulness of both approaches in the ensemble.

Assessments of RCM added value from other studies and regions have generally reported mixed results, with examples of both improvements and deteriorations after downscaling (Di Luca et al. 2016 and references therein). As discussed in Di Luca et al. (2016) for Australia, the largest added value was generally found in regions of complex topography and land-sea contrasts, where GCM biases can be very large. They also generally reported the largest added values for extreme statistical quantities compared to mean fields. Our results are broadly consistent, whereby we report the largest added value for precipitation over high elevation regions of complex terrain, where precipitation totals (for both the mean and extreme statistics) are consistently and considerably underestimated across all GCMs. Similarly for temperature, considerable improvements are found in high elevation regions where the representation of more complex fine-scale patterns becomes apparent. In contrast, the temperature climatology fields in GCMs over New Zealand tend to be too spatially homogeneous, often varying only by latitude (e.g. ACCESS-CM2, Fig. 4).

Using a larger RCM ensemble from CORDEX, Di Virgilio et al. (2020), found that temperature (maximum and minimum) for both mean and extreme statistics over Australia were generally degraded in the RCM output when downscaling reanalysis. For downscaling GCMs, they also reported instances where downscaling did not add value for temperature fields over the historical period. It is therefore encouraging that in this study for New Zealand we find generally greater consistency in added value for temperature extremes (Fig. 12). A likely reason for this difference is that, compared to Australia, a greater proportion of New Zealand land area is dominated by complex terrain and land-sea contrasts, allowing ample opportunities for the RCM to ameliorate substantial GCM biases. Consistent with this, Di Virgilio et al. (2020) also found that added value was generally much greater for the mean and extremes of temperature fields across isolated stretches of Australian coastline and across the Australian Alps.

3.5 Tropical cyclones

Tropical cyclone climatology was analysed on a per model basis, before and after downscaling from CCAM with tracking from TempestExtremes (Ullrich et al. 2021). Consideration of cyclone intensity was based on the relatively simple metric of minimum cyclone central pressure along the track (Roberts et al. 2020; Gibson et al. 2023), this reduces data requirements and helps improve consistency across tracking methodologies and best track databases. Consideration of cyclone strength is important in the context of downscaling, given issues in coarse resolution GCMs (and reanalysis) where strong tropical cyclones generally occur too infrequently (Roberts et al. 2020).

We begin by showing an example (Fig. 13) of the tropical cyclone climatology from NorESM2-MM focusing on relatively strong cyclones (reaching at least category 3, MSLP < 965-hPa). As shown in Fig. 13a, compared to the reference SPEArTC, NorESM2-MM substantially underestimates the frequency of cyclones that reach this strength (NorESM2-MM = 35 versus SPEArTC = 119). While observational uncertainties exist for tropical cyclone counts (Schreck et al. 2014), this uncertainty is likely small relative to the magnitude of this bias. For example, IBTrACs was found to produce a very similar overall climatology (n = 123, not shown). After downscaling NorESM2-MM with CCAM (Fig. 13b) the underrepresentation is reduced considerably where CCAM produces n = 100 events over this time period. The genesis locations and tracks also appear to be generally well captured by CCAM in a climatological sense. One apparent difference is that that the tracks from CCAM appear to reach into high latitudes more readily, especially to the east of the international dateline. A similar finding was reported in Gibson et al. (2023) when CCAM was driven by observed SSTs, indicating that this apparent difference is not driven by biases in the GCM large-scale circulation fields or biases in SST. It is difficult to determine whether this is a true bias in CCAM or whether it stems more from underlying differences in how storms are recorded in these observational track databases (across both IBTrACS and SPEArTC). For example, as discussed in Schreck et al. (2014), counts can be more uncertain as storms traverse into higher latitudes due to differences in counting procedures between reporting agencies and changes to these procedures across time which are unaccounted for. Future research evaluating the background atmospheric environmental conditions generated by CCAM relevant to the maintenance of cyclones at higher latitudes, namely vertical wind shear, would also be beneficial (Walsh and Katzfey 2000).

Fig. 13
figure 13

Genesis locations (yellow circles) and tracks of tropical cyclones for (a) NorESM2-MM (raw), (b) NorESM2-MM (CCAM), (c) SPEArTC as reference. Tracks are shown with genesis location constrained to 0-25oS and where the cyclone MSLP reaches at least Category 3 intensity (below 965-hPa along the track) for November through April 1982–2014. The number of events is shown in the top left corner for each product

Next, we extend this analysis of added value in the representation of tropical cyclone climatology across GCMs and for different intensity categories. As shown in Fig. 14a, for storms that reach at least category 2 (MSLP < 980-hPa), downscaling considerably improves the cyclone frequency across GCMs which were otherwise consistently underrepresented. This general, and often substantial, underrepresentation in storm frequency in the coarse-resolution GCMs is consistent with other studies in other basins (e.g. Roberts et al. 2020). However, it is notable that the number of storms after downscaling with CCAM is shown here to be strongly dependent on the GCM. For example, after downscaling NorESM2-MM and EC-Earth3 have too many events while the other downscaled GCMs compare reasonably well with observational estimates at this intensity.

Fig. 14
figure 14

Cyclone counts per year separated by cyclone category before downscaling (grey bars, GCM) and after downscaling (blue bars, CCAM) for November through April 1982–2014. When defining cyclones, the same region is used as in Fig. 13i.e. with genesis constrained by 0-25oS. Horizontal lines show counts from two different reference products. Models with asterisks indicate where CCAM is driven only by the bias-corrected SST/SIC fields from the host GCM. Note AWI-CM-1-1-MR is not shown here due to data availability limitations of the host GCM output

For stronger category 3 + events (MSLP < 965-hPa, Fig. 14b), while CCAM considerably increases the frequency across all GCMs (typically by more than a factor of 2), the frequency remains generally underestimated by CCAM after downscaling. The exception is EC-Earth3 which slightly over predicts frequency at this intensity. Similarly, for category 4 + events (Fig. 14c) the underrepresentation further increases across all models. The sensitivity to tracking parameters including warm core criteria (Ullrich et al. 2021) is shown in Supplementary Material Figure S16. When applied to downscaled CCAM output, modification to tracking parameters and warm core inclusion typically reduces tropical cyclone frequency by around 5–20% (depending on the model and intensity) and the overall variability between models is preserved. From this we conclude that sensitivity to tracking parameters is small in the context of the often-large improvements (especially in the category 2–3 range) in frequency from downscaling.

Over the historical period, only two category 5 events were simulated by CCAM, both when downscaling EC-Earth3, compared to 19 events of this magnitude in observations across both IBTrACS and SPEArTC (not shown). A snapshot of the MSLP and wind fields from CCAM during one of these extreme TC events are shown in Supplementary Material Figure S17. During this event, the central MSLP reached a track minimum of 913.5-hPa with hourly maximum 10-m surface wind speeds of 72.6 m s− 1. Although clearly sensitive to the driving GCM fields, it is encouraging that CCAM can produce an event of this magnitude at this resolution.

The differences in the representation of tropical cyclones between downscaled GCMs indicates that the thermodynamic and dynamic conditions responsible for cyclone genesis, intensification and maintenance differ strongly between GCM driving fields. To better understand and quantify this, a target for future research is to investigate these conditions (e.g. SSTs, SST gradients, wind shear) across GCMs before and after downscaling. Another possible cause for the underrepresentation of the most extreme events is the model spatial resolution across the tropics and subtropics north of New Zealand. While the CCAM resolution (variable by ~ 15-25-km) in this region is greatly improved compared to the host GCMs (~ 80-150-km depending on the model) it may still be a limiting factor. Roberts et al. (2020) found that models with enhanced resolution around 25-km typically tend to have more frequent and stronger tropical cyclones with generally reduced biases compared to events in CMIP6-class models. However, category 3 and 4 storms were still commonly underrepresented by GCMs at this higher resolution as well as in modern reanalysis products.

As discussed in Roberts et al. (2020) further work is needed to understand how parameterizations of unresolved processes could benefit representation of the most intense tropical cyclones in models at ~ 25-km resolution. Process-based examinations of the boundary layer, convection and surface drag schemes may lead to additional insights. There are at least a few examples of global models at ~ 15 km resolution that can simulate very strong tropical cyclones well (e.g. Chauvin et al. 2020), but the results appear rather sensitive to the coefficients of the turbulence scheme that enhance convection (Roberts et al. 2020). Future research should examine this further in CCAM, targeting the impact of the turbulence scheme and different aspects of the stretched grid configuration. There is already preliminary evidence from Gibson et al. (2023) for the importance of the grid configuration in CCAM, where the number of storms reaching category 2 decreases by ~ 50% over the South Pacific basin when the model is run with a quasi-uniform ~ 100 km resolution grid compared to the high resolution stretched grid used by CCAM in the present study. Further examination of the sensitivity of strong tropical cyclones to the placement of the high-resolution grid (i.e. including placing over the main genesis regions) and to the Schmidt grid stretching factor would be useful.

Despite these remaining biases in CCAM, it is worth highlighting that the large improvements in tropical cyclone frequency for events reaching at least category 2 (and more modest improvements for events reaching at least category 3) are likely to be valuable for future studies exploring tropical cyclone risk under climate change in this region. For example, one of New Zealand’s most devastating weather events on record was caused by Cyclone Gabrielle in February 2023 (Harrington et al. 2023), which had a lowest track MSLP of 958.1-hPa (category 3) before undergoing ex-tropical transition and impacting New Zealand with a strength of 966.6-hPa recorded over land (equivalent to category 2 strength). Given the also important role of atmospheric rivers in driving precipitation extremes over New Zealand (e.g. Prince et al. 2021; Reid et al., 2021), future work will focus on evaluating and understanding future projections of these features in this new set of downscaled simulations.

3.6 Observational uncertainty and caveats

Observational uncertainty is an important consideration in the context of climate model evaluation, especially for extreme events and in regions where observations are sparse (e.g. Gibson et al. 2019; Alexander et al. 2020). Compared to many regions of the globe, New Zealand’s observational network is relatively dense over the time period of model evaluation used here (1982–2014), a caveat is that few long-term station observations are available in high elevation regions where precipitation is often highly heterogenous (Tait et al. 2012). As discussed earlier, the observational uncertainty in VCSN in these regions is relatively large which imposes certain caveats to the interpretation of model biases. It would be useful to re-assess the model biases documented here for any future releases of VCSN that account for additional station data or more sophisticated approaches to spatial interpolation that directly account for the complexities of extreme events.

Despite these caveats, we suggest that VCSN currently provides the most reliable high-resolution daily gridded rainfall and temperature observations for evaluation of model climatologies over New Zealand. Vishwanathan et al. (2023) recently compared several existing gridded products over New Zealand relative to station observations, including VCSN (non-bias corrected), merged satellite-gauge products and reanalysis. The authors report that VCSN and MSWEP (a merged satellite-gauge product, Beck et al. 2019) generally compare well and provide the overall best agreement with station observations, though this varies by region and metric. However, their assessment of MSWEP is complicated by the fact that it was constrained to a limited set of station observations, which do not sample high elevation regions well. More recently, Campbell et al. (2024; submitted) found that MSWEP tends to substantially underestimate high-elevation precipitation, including for both mean and extremes. This likely stems from the fact MSWEP is weighted towards lower resolution reanalysis products that are known to underestimate high elevation precipitation across New Zealand (e.g. Pirooz et al. 2021 and Fig. 3 above) combined with considerably fewer in-situ precipitation observations being ingested than in VCSN. For this reason, and despite the important caveats of VCSN described here, we suggest that the use of MSWEP would not provide additional value.

The evaluation here was based on assessing the raw downscaled model output (i.e. no additional bias correction has been performed). This provides a useful overview of the model strengths and weaknesses and is also important since certain aspects of model biases cannot be easily corrected through bias correction (Maraun et al. 2017). However, in our experience, since bias corrected output is highly sought after and often essential for certain climate impacts applications and services, this will be performed across all simulations for a reduced set of core variables. The bias correction results and methodology will be documented and made available in a separate publication.

4 Conclusions

The overall aims of this study were to: (1) document the experiment design for producing the updated CMIP6 dynamical downscaled climate projections for New Zealand; (2) comprehensively assess biases and quantify the added value over the historical period of the downscaling across climatological fields, circulation regime composites, extreme indices, and tropical cyclones.

The experiment design involves downscaling 6 GCMs with CCAM (primarily) with a stretched grid configuration. The stretched grid centres high resolution over New Zealand (~ 12-km) accompanied by relatively high resolution (12-35-km) over the wider South Pacific. This extended high-resolution domain appears favourable for the representation of tropical cyclone frequency and may also prevent issues around the placement of the inner domain in typical limited area RCMs. While focus here was exclusively on CCAM, the wider downscaling project for New Zealand also includes two other RCMs (WRF and the UM) downscaling reanalysis and select GCMs. The selection of which GCMs to downscale was informed by balancing evaluated performance over the historical period, future rate of warming, and model independence. From this, three of the 6 GCMs were downscaled directly with CCAM through spectral nudging to atmospheric fields. The remaining three GCMs were downscaled using only their bias-corrected SST/SIC fields as lower boundary conditions to drive CCAM.

In terms of added value, the main strengths of the downscaled output are:

  • Precipitation and temperature mean climatology fields are generally greatly improved compared to the host GCM. The largest and most consistent improvements are for daily minimum temperature. While certain biases persist, the representation of orographic precipitation is greatly improved after downscaling both reanalysis and GCMs.

  • When driven by reanalysis, relative to observations, CCAM produces very similar precipitation patterns and amounts according to circulation regime type.

  • Interannual variability in temperature is generally in close agreement with observations. There is no evidence of systematic under or overestimation of variability.

  • Large consistent improvements for several temperature and precipitation-based extreme indices after downscaling relative to the host GCM.

  • Generally large improvements to the representation of tropical cyclone frequency in the category 2 and 3 intensity range, which are consistently and strongly under-represented in the host GCMs.

The main consistent weaknesses of the downscaled output are:

  • Climatological precipitation in certain low elevation regions appears overestimated by CCAM relative to VCSN, particularly in the lee of the Southern Alps.

  • Positive precipitation bias in winter across the SST/SIC-driven simulations. This is related to CCAM producing an overly active storm track which impacts the South Island.

  • Of the extreme indices evaluated, consecutive wet days (CWD) and consecutive dry days (CDD) generally have the largest relative biases. In particular, CCAM tends to slightly overestimate CWD in most locations (i.e. wet spells are too long) and underestimate CDD (i.e. dry spells are too short) driven by too frequent low intensity rainfall.

  • In certain regions, the temperature during hot extremes has a warm bias, but biases are generally much lower than the host GCMs.

  • While there are general improvements to TC frequency, the frequency of very strong category 4 and 5 events remain too low in CCAM. However, it is encouraging that CCAM can occasionally simulate a category 5 intensity event.

Through comprehensively documenting the relative strengths and weaknesses of the downscaled output in this study, we hope that users will be able to make better informed decisions about the most appropriate use of the data for a given climate application. In our ongoing bias correction work, we are targeting specific methods that best address the largest remaining model output biases documented here, while seeking to retain physically plausible climate change signals in the future projections.