1 Introduction

From the earliest days of initialized multi-year predictions with Earth system models, two of the biggest challenges have involved not only what initialization method to use, but also how to deal with the large drifts that result as a consequence of systematic errors in the models (e.g. Meehl et al. 2009, 2014, 2021). Several initialization methods have been applied (Meehl et al. 2021), and bias and drift can be removed by use of one of several different methods (Meehl et al. 2022). For the latter, the most common is to run a large hindcast set, form a model climatology of drifted states to compare to the lead year predictions to be evaluated, compute anomalies, do a similar calculation for observations, and evaluate the hindcast skill of the model anomalies compared to the observed anomalies for the same period (e.g. Doblas-Reyes, 2013). Another is to compute mean drift adjustments from observations for the lead years of interest, remove those drifts from the hindcasts, and then compute anomalies from the previous 15 years of observations prior to the initial year of the prediction (e.g. Meehl et al. 2016). A third is a compromise between the first two whereby a mean drift is calculated in the hindcasts for the 15 model years prior to the initial year, and used as a reference to compute anomalies, with a similar calculation for the observations (Meehl et al. 2022). It has been noted that this third method is likely the most “fair” of the three methods in that computing a model climatology over the entire reference period is not what occurs in a real-time prediction exercise where you only have information prior to the initial year (Risbey et al. 2021). In an evaluation of those three methods, it was determined that the “fair” method (comparing to the previous 15 year climatologies) has somewhat less skill as could be expected (Risbey et al. 2021), though all were roughly comparable with skill varying somewhat differently as a function of time, but with no method clearly superior to the others (Meehl et al. 2022).

To run a large set of hindcasts with which to form the climatologies to compute anomalies of the initialized predictions is a massive computational burden. Consequently, only occasionally is there enough computer time available to perform such a set of simulations. This limits the research that can be done if it is desired to perform case studies to study processes, or experiment with alternate model configurations and resolutions or different initialization methods. Here we first apply two different initialization methods in two different Earth system models (Energy Exascale Earth System Model version 1, E3SMv1; and the Community Earth System Model version 1, CESM1) with a limited set of start years to compare the model-dependence and initialization-dependence of the drifts. As the verification metric, we quantify the skill of predictions of spatial patterns of multi-year Pacific sea surface temperature (SST) anomalies in the domain of the Interdecadal Pacific Oscillation (IPO). The dominant pattern of decadal timescale SST variability in the Pacific is the IPO (Power et al. 1999). The IPO can be represented as the second EOF of low pass filtered non-detrended observed SSTs, with the PC time series from that EOF as an IPO index (e.g. Meehl and Arblaster 2011). The IPO SST anomaly pattern is characterized in its positive phase by positive SST anomalies in the tropical Pacific, and negative SST anomalies in the northwest and southwest Pacific, with opposite sign SST anomalies for the negative phase. The objective of using Pacific region SSTs as the verification metric is to be able to predict elements of this IPO SST pattern. We then use the much larger hindcast data set from the Decadal Prediction Large Ensemble (DPLE) with CESM1, and compute the drifted hindcasts to form a model climatology, and then calculate anomalies to compare to anomalies formed from the fully-drifted free-running historical simulations. The objective is to determine whether a smaller set of start years and computing a model climatology from the historical simulations is comparable to using a much larger set of start years with a model climatology computed from the drifted hindcasts to compute anomalies, and whether either makes a difference when using different initialization methodologies.

2 Models, simulations, and initialization methods

The two Earth system models analyzed here are E3SMv1 and CESM1. The components and coupled simulations of E3SMv1 are documented by Golaz et al. (2019). The horizontal resolution of the E3SMv1 configuration analyzed here consists of a 110-km atmosphere, 165-km land, and a 0.5° river model. The MPAS ocean (Petersen et al. 2019) and sea ice models in E3SMv1 have a mesh spacing that varies from 60 km in the mid-latitudes and 30 km at the equator and poles. The 60 vertical layers use a z-star vertical coordinate.

E3SM was originally branched from CESM but has since evolved with significant developments in all model components. The E3SM atmospheric model (EAMv1) was built upon the CESM atmospheric model (CAM Version 5.3) but the two models diverge in details of tuning and cloud and aerosol formations, and EAMv1 uses a spectral element dynamical core. The ocean represents probably the biggest difference between E3SMv1 and CESM1 in that E3SMv1 uses an unstructured mesh ocean model (MPAS-Ocean) while CESM1 includes a conventional dipolar mesh grid ocean model (POP2) as discussed below. An overall description of CESM1 is provided by Hurrell et al. (2013). The atmospheric model in CESM1, as in E3SMv1, has a nominal 1 degree latitude-longitude resolution. Note the CESM atmospheric model (CAM5.3) uses a finite-volume dynamical core. The ocean model in CESM1 is a version of POP2 (Parallel Ocean Program, version 2). The ocean has a nominal 1 degree horizontal resolution and enhanced meridional resolution in the equatorial tropics, and 60 levels in the vertical with ocean biogeochemistry. While the meridional resolution in the MPAS ocean in E3SMv1 is very similar to POP2 in CESM1, the grid cells are much more isotropic in E3SMv1. This gives MPAS ocean in E3SMv1 a much higher effective resolution than that in CESM1. Other features of CESM1 involving land and sea ice are described in detail by Hurrell et al. (2013).

The predictions in E3SMv1 and CESM1 are initialized on November 1 of the start years listed in Table 1, and each uses two different initialization schemes. The first is the “forced ocean sea ice” (FOSI) method (Yeager et al. 2018). In this method the ocean model is run repeatedly over the historical period (1958–2018) with observed atmospheric forcings. With each forcing cycle, the upper ocean asymptotes closer to the climatological base state such that year-to-year variability increasingly reflects the ocean response to the imposed atmospheric conditions (Tsujino et al. 2020). By the fifth cycle (used for the initial states of the hindcasts), the ocean model evolution is dominated by the observed atmospheric forcing and resembles the observed time evolution of the ocean (Yeager et al. 2018). This method is relatively economical and leaves the ocean in a state that contains some of the model systematic errors but also retains elements of the observed ocean states.

Table 1 Initial years and number of ensemble members for each model and initialization method. Note that CESM1 FOSI is a subset of the CESM1 DPLE. The initial year refers to the start date of Nov. 1 of that year. Lead month 1 is the December of the initial year, and lead year 1 is the first full calendar year after the Nov. 1 start date. The historical large ensembles use 13 members from E3SMv1, and 33 members from CESM1 averaged from 1985–2017

The second initialization scheme is termed the “brute force” (BF) method (Kirtman and Min 2009; Paolino et al. 2012; Infanti and Kirtman 2017). The data source for the initialization is the Climate Forecast System Reanalysis (CFSR; Saha et al. 2010), and in the above references the ocean, land and sea ice are initialized. In the application with E3SMv1 and CESM1 described here, only the ocean component is initialized. Atmosphere, land and sea-ice states are randomly sampled from long climate simulations.

The following is the procedure to generate the ocean restart file converted from the CFSR ocean data assimilation restart file. The same procedure is used for both E3SM and CESM. Both restart files have different resolutions in the horizontal and vertical. The CFSR fields have been interpolated horizontally and vertically using a bilinear interpolation scheme. Climatological data from long model simulations are used in regions where CFSR data is undefined with respect to the model grids. The surface pressure for the ocean component of the models is estimated using the sea surface height, and the pressure gradient terms are estimated using centered differencing. The salinity initial condition for model restarts is constructed by adding anomalies from CFSR, rescaled to have the same anomaly standard deviation as CESM1, to the model climatology. This approach has proven successful for NMME and SubX forecasts with CCSM4 at ocean eddy parameterized scales (e.g., Infanti and Kirtman 2017) and at ocean eddy resolving scales (e.g., Infanti and Kirtman 2019; Siqueira et al. 2021).

Of these two initialization methods, BF, where reanalyses are interpolated straight to the model grid, is the most economical and the ocean initial state is, by definition, very nearly the reanalysis observed state. But the initial drifts could be expected to be larger and more rapid as the model wants to quickly leave the observed state and get to its systematic error state. FOSI, on the other hand, is more expensive (but still economical compared to a data assimilation method) and the ocean initial state is a compromise between a representation of the observations imprinted on the ocean from the observed atmospheric forcing, and the ocean systematic error state. Thus, coupling shock and initial drifts could be expected to be less compared to the BF method. But it is unclear if either shows clear advantages over the other.

With regards to generating ensembles, the BF ensembles for E3SMv1 and CESM1 are generated using the CFSR atmosphere and land states chosen randomly from the extended free running coupled simulation, using the full fields from that atmosphere and land. For example, randomly chosen end of October states are used for the forecasts initialized at the beginning of November. No perturbations are made to the ocean and ice states. For the FOSI ocean initial states, E3SMv1 and CESM1 both generate different ensemble members through random small perturbations in the atmospheric model (Yeager et al. 2018).

Table 1 shows the initial years and number of ensemble members for the two models and two initialization methods. All cases are run for 5 years. Note that some start years have 5 ensemble members while others have 3. The smaller number of ensemble members in the latter will introduce some additional noise to the results. These will be compared to the much larger number of start years (every year from 1958 to 2017) and ensemble members (40 for each start year) from the CESM1 DPLE (Yeager et al. 2018). The CESM1 FOSI experiments are 5 member and 5 year subsets taken from DPLE. The CESM1 historical simulations are from the large ensemble (LE) described by Kay et al. (2015). Averages are taken from a subset of the LE with ensemble size of 33. The E3SMv1 historical runs are the standard CMIP6 simulations initialized from different years of the 1850 preindustrial control with ensemble size of 13 described by Golaz et al. (2019).

3 Drifts from initialized states

As has been documented previously, models initialized close to some observed state tend to drift away from the initial state, with the drifts large and rapid (e.g. see Fig. 1 in Meehl et al. 2022). To illustrate the drifts in the two models with the two initialization methods, Fig. 1 shows SST differences for lead month 1 (models are initialized on November 1, so lead month 1 is the first December after initialization), and lead years 1 (the first full calendar year after the November 1 initialization time), 3 and 5, for CESM1 BF, E3SMv1 BF, CESM1 FOSI, and E3SMv1 FOSI for all available ensemble members in Table 1. As noted above, since the FOSI method does not bring the model initial state into perfect agreement with observations at the time of initialization, and BF does, the SST errors are larger at lead month 1 (top row of Fig. 1) in both models with the FOSI method compared to the BF method. But even in the space of one month, the BF models have drifts on the order of about +/- 0.5 C. Consequently, root-mean-square (RMS) errors for lead month 1 for CESM1 BF (0.34) and E3SMv1 BF (0.44) are smaller (i.e. closer to the observed state) than CESM1 FOSI (0.51) and E3SMv1 FOSI (0.95), the latter having larger lead month 1 errors than CESM1 FOSI.

Fig. 1
figure 1

Ensemble mean SST bias as represented by drifts (°C) from the observations for start month of November for CESM1 BF (first column), E3SMv1 BF (second column), CESM1 FOSI (third column), and E3SMv1 FOSI (fourth column) for lead month 1 (first row, averaged for December monthly mean), lead year 1 (row 2), lead year 3 (row 3), and lead year 5 (row 4). Historical large ensemble systematic errors are shown in the fifth row, using 13 ensemble members from E3SMv1 and 33 from CESM1, averaged from 1985–2017. The cosine weighted 60°S-60°N root mean square error is shown in the upper left corner of each panel

What is apparent from Fig. 1 is that each model and initialization method settles into a preferred error pattern that is already recognizable at lead year 1 (second row of Fig. 1) as being similar to the systematic error pattern in the historical simulations (bottom row of Fig. 1). In fact, three (with the exception of CESM1 BF) have very similar drift patterns (the three rightmost columns in Fig. 1), with negative SST anomalies in the northwest Pacific, subtropical southwest Pacific, tropical North Atlantic, subtropical southern Indian Ocean, and equatorial Pacific. There are positive anomalies in those three in the subtropical southeast and northeast Pacific, and subtropical eastern Atlantic. The CESM1 BF has some similarities to the other three in the Northern Hemisphere, but with mostly positive SST anomalies across the Southern Hemisphere oceans. Additionally, CESM1 BF SST biases asymptote more rapidly (by year 3) to the historical systematic errors in the bottom row than the other hindcast sets, particularly the E3SMv1 sets. This could be partially explained by the fact that the POP ocean is more diffusive, as noted earlier. Thus, the CESM1 drift from the observed BF state would be more rapid than the E3SMv1 BF or E3SMv1 FOSI. Meanwhile, the biggest difference in drifts between CESM1 BF and CESM1 FOSI lie in the Southern Hemisphere where CESM1 FOSI drifts less rapidly toward the eventual systematic error state in that region. This is likely due in part to the FOSI initialization fixing an initial error state that is farther from observations represented by BF in the Southern Hemisphere. Therefore, the BF state can drift more rapidly to the eventual error state than FOSI in CESM1.

Though the drift patterns set up early, the magnitudes take until about lead year 3 to stabilize. This can be seen by the RMS errors increasing from lower values in lead month 1, to being larger, as the errors grow, but very similar in lead years 3 and 5 in all four configurations, with differences in RMS errors all less than 4%. Thus, in Fig. 1, RMS errors in lead year 3 and 5 for CESM1 BF are 0.86 and 0.83; in E3SMv1 BF they are 0.87 and 0.89; in CESM1 FOSI they are both 0.69; and in E3SMv1 FOSI they are 0.89 and 0.90.

Another way of quantifying how the drifts stabilize in pattern and magnitude is by calculating the drift in lead month 1, lead year 1, and lead year 3 as a percent of the drift in lead year 5 for the four configurations (Fig. 2). The darker and more extensive the dark red colors in Fig. 2, the closer the drifts are to the eventual drift values in lead year 5. The globally averaged percentages of drift for ocean grid points are given at upper left of each panel. For lead year 3, these values range from 74.7% for CESM1 FOSI to 86.2% for CESM1 BF (where a value of 100% would indicate drift numbers equaling drift in lead year 5). For the latter, in lead year 3 almost all the global oceans are close to the eventual drifted values in lead year 5, but for the other three configurations there are notable areas where the drift is not yet very close to the eventual drift values in lead year 5. These areas include the equatorial, eastern subtropical, and the South Pacific Convergence Zone (SPCZ) areas of the Pacific Ocean. There are some interesting fluctuations of the drift in the two FOSI methods (the rightmost two columns) where drift percentages are near 80% in the equatorial eastern Pacific in lead year 1, and then are reduced to less than 30% in lead year 3. This suggests that the time-dependence of the drifts in those regions using the FOSI method in both models includes substantial non-monotonicity in the equatorial Pacific, possibly through ocean dynamics related to initialization shock (Teng et al. 2017), with this variation more pronounced in CESM1 FOSI.

Fig. 2
figure 2

Fraction of lead year 5 SST drift (°C) computed as a % for lead year drift divided by year 5 drift for CESM1 BF (first column), E3SMv1 BF (second column), CESM1 FOSI (third column), and E3SMv1 FOSI (fourth column) for lead month 1 (first row), lead year 1 (row 2), lead year 3 (row 3). Average percent of drift for total ocean area is at upper left of each panel. White areas denote values with opposite sign

To further highlight differences in drift between the two initialization methods and two models, Fig. 3 shows when BF and FOSI drifts are the same sign (blue colors indicate both are negative or both are positive), while red colors denote drifts of opposite sign. For E3SMv1 in the right column, both for lead year 3 and lead year 5, nearly all ocean areas are the same sign (blue) indicating very similar drift characteristics between the BF and FOSI initialization schemes. However, CESM1 (left column) shows differences of drift sign between the two initialization schemes mainly in areas of tropical precipitation maxima in the Indian and Pacific Oceans. This was noted in Fig. 1 where the lead year 5 drifts from CESM1 BF (bottom of left column in Fig. 1) show nearly all positive values in the Intertropical Convergence Zone (ITCZ), SPCZ and tropical Indian Ocean compared to negative values in CESM1 FOSI (bottom of second from right column). One possibility for this result is that the hindcast sample is too small to isolate drift, and the differences highlighted in Fig. 3 are due to noise. However, using all 64 start years for the hindcasts from DPLE for CESM1 shows nearly identical patterns to CESM1 FOSI with the smaller number of start years in the third column of Fig. 3 (not shown). Thus, the drifts are robust even when using a smaller number of start years, so the differences in Fig. 3 are not due to noise. This indicates that there likely is a difference in interannual drift variability (ENSO evolution) that is systematic and likely related to how initialization shock manifests in FOSI vs. BF which also has been seen in other models (e.g. the CanCM4, Fig. 1 5 in Saurral et al. 2021).

Fig. 3
figure 3

Comparison of sign of the drift between BF and FOSI methods as the fraction of when BF and FOSI are the same sign (blue = both negative or both positive) or opposite sign (red) for a given model (CESM1, left column; and E3SMv1, right column) for lead month 1 (first row), lead year 1 (row 2), lead year 3 (row 3), and lead year 5 (row 4)

4 Computing anomalies to evaluate hindcast skill

Removing drift from initialized hindcasts usually involves computing some sort of drifted climatology from the model with which to compute anomalies that can be compared to comparable anomalies over those time periods from observations (Doblas-Reyes et al. 2013; Meehl et al. 2022). However, since the patterns of model drifts shown in Fig. 1 are comparable for each model and initialization method, and approach the mean systematic errors of the historical ensemble averages from the two models (e.g. pattern correlations between lead year 5 drifts and historical large ensemble climatological errors are all greater than + 0.7), it may be possible to use ensemble mean systematic errors from the uninitialized historical simulations to represent the mean drifts toward that systematic error state in the initialized hindcasts from the models. This would allow the creation of many more samples of model climatology to calculate anomalies from the limited set of initial years of the initialized hindcasts.

To illustrate the advantage of a larger number of samples from which to compute a model climatology, Fig. 4 shows anomaly pattern correlations for the IPO region in the Pacific (40°S-70°N, 100°E-80°W) which has been a target for previous initialized hindcasts (e.g. Meehl et al. 2016; 2022). If all the limited hindcast drifted states are used to compute the model climatology, Fig. 4a shows much scatter among ensemble members and relatively low positive or even negative anomaly pattern correlations for both models and both initialization methods. Using the available initialized hindcasts from the 15 year periods prior to the initial years to compute the model climatology (Fig. 4b) as in Meehl et al. (2022), there is still considerable noise among ensemble members and some large negative anomaly pattern correlations. In Fig. 4b, given the small number of initial years, the 1990 set is dedrifted using only the 1985 set, and the 1995 set uses 1985 and 1990, so that the first year plotted is 1994 representing the start year 1990, thus introducing additional noise for those first few initial years. By using the available drifted hindcasts for the climatology (Table 1), there are so few samples that the average can be far from the average of drifted hindcasts from a much larger ensemble. This introduces bigger anomalies in the former, with some far from the observations, which produces greater noise in the correlations. Even with the same number of samples used from the LEs as noted in Table 1, these ensemble members represent a more equilibrated drifted state and the errors are better sampled so that the ensemble average is closer to the actual drift from the 13 member (E3SMv1) or 33 member (CESM1) LEs. If all 33 members from the CESM1 LE are used as in Fig. 5 discussed below, there are many more samples, the errors are better sampled, and the anomalies are smaller and closer to the observations with better skill and reduced noise. Therefore, if the ensemble averages from the respective model historical uninitialized simulations are used to compute the model climatology (Fig. 4c), noise among the ensemble members is reduced as could be expected by using a larger number of samples from the historical ensemble averages compared to Fig. 4a. However, there is evidence of a possible trend being introduced into the climatology (as noted by Meehl et al. 2022) and consequent anomaly pattern correlations, with mostly negative values in the early period and positive values in the later period. Additionally, computing anomalies by subtracting the uninitialized climatology (lower row of Fig. 1) will tend to produce negative IPO-like patterns since the hindcast climatology in years 3–5 is generally less positive IPO-like than uninitialized climatology (Fig. 1). Even if drift pattern correlations with uninitialized errors are fairly high, the amplitude in years 3–5 is quite a bit lower than the uninitialized (except CESM1 BF). Consequently, the negative-IPO anomalies will show better pattern correlation skill after ~ 2000 when the observations switched to negative IPO.

Fig. 4
figure 4

Anomaly pattern correlations of SST for the IPO region of the Pacific (40°S – 70°N; 100°E – 80°W) for the limited set of initial years (Table 1) and 3–5 year average leads. Reference climatology used to calculate anomalies to compare to observations is (a) average of drifted initialized hindcasts for those start years and comparable periods from observations, (b) same as (a) except using drifted hindcasts in the 15 years prior to the initial years using the smaller number of CESM1 DPLE hindcasts according to Table 1, (c) same as (a) except using the climatology from the CESM1 and E3SMv1 historical large ensembles, (d) same as (a) except for climatology from previous 15 years of the CESM1 and E3SMv1 historical large ensembles. Solid lines are ensemble averages (model key at upper left in panel c), and thin lines are individual ensemble members. Horizontal dashed line is a nominal reference value for anomaly pattern correlation of 0.5. The year labels on the x axis refer to lead years 3–5 of each hindcast that are centered on year 4, so, for example, 1989 on the x axis signifies the hindcast initialized in November 1985

Fig. 5
figure 5

Anomaly pattern correlations for the IPO region of the Pacific (40°S – 70°N; 100°E – 80°W) for the full set of initial years, each of 64 initial years from 1954 to 2017, reference climatology uses the previous 15 years prior to the initial years to calculate anomalies to compare to observations, so first year plotted is 1954 + 15 or 1969, from the CESM1 DPLE for (a) average 3–5 year leads using climatology from the drifted initialized hindcasts; (b) same as (a) except for 3–7 year leads; (c) 3–5 year average leads using climatology from the large ensemble; (d) same as (c) except for 3–7 year average leads. Dots on panels a and c are taken from Fig. 4b and d, respectively for the smaller number of start years and ensemble members. Thin red lines are individual ensemble members, and bold red line is the pattern correlation for the ensemble mean, and the mean and standard deviation are in the upper left of each panel. Horizontal dashed line is a nominal reference value for anomaly pattern correlation of 0.5

If the previous 15 year periods are used to compute the model climatology from the uninitialized historical simulations (Fig. 4d), the possible effect of a trend in the historical climatology evident in Fig. 4c is reduced as documented by Meehl et al. (2022). This is because the previous 15 years have a much smaller trend than the entire historical period. Consequently, the spread among ensemble members in the comparable calculation using the small number of drifted hindcasts in Fig. 4b is reduced, and the hindcast periods that are not negatively affected by volcanic eruptions (which reduce hindcast skill after Pinatubo, and the cumulative effects of the Tavurvur and Nabro eruptions in the early 2000s, e.g. Santer et al. 2014; Meehl et al. 2015; Wu et al. 2023) show anomaly pattern correlations near or above 0.5. The drop in skill for years after 2020 is likely related to the multi-year La Niña event that started in 2020 that could have been externally forced by smoke from the catastrophic large scale wildfires in Australia in 2019–2020 (Fasullo et al. 2021, 2023). Since none of the hindcasts considered here, including DPLE, included the Australian wildfire smoke, and if the multi-year La Niña event was at least partly externally forced, the verification observational data would include the effects of the smoke, thus reducing hindcast skill of the model simulations that did not include the smoke.

In any case, comparing the four sets of results in Fig. 4, the method that appears to exhibit the greatest skill and least noise among ensemble members uses the previous 15 year method from the ensemble average of the uninitialized historical simulations from the two models in Fig. 4d.

To provide greater context from which to evaluate the results in Fig. 4, Fig. 5 shows similar anomaly pattern correlations for the Pacific region SSTs for the larger number of initialized hindcasts in the DPLE (using CESM1 and the full 40 members from initial years overlapping those in the limited hindcast set; recall that CESM1 FOSI used only 5 ensemble members from DPLE). Because previous work used the average 3–7 lead years (e.g. Meehl et al. 2016, 2022), and recalling that the limited initial year model simulations only extend to lead year 5, the somewhat longer 3–7 year leads are included with the 3–5 year leads to judge how a longer prediction average (lead years 3–7) relates to the lead year 3–5 average analyzed earlier. Values from each initial year and each ensemble member are plotted in Fig. 5 as the thin red lines. The thick red line is the pattern correlation of the 40-member ensemble mean. Note that the ensemble mean will not necessarily fall in the middle of the spread of individual ensemble members since the correlation is not a linear operation. In other words, if the ensemble mean is closer to observations, there is a higher likelihood that the ensemble mean will have the highest correlation. The dots on Fig. 5a,c are the same as those in Fig. 4b,d for the limited set of start years in both models and both initialization schemes to allow direct comparison with the larger number of start years and ensemble members from the DPLE.

Figure 5a,b shows anomaly pattern correlation values for 3–5 year leads using the ensemble average of the drifted hindcasts from the CESM1 DPLE and the 3–7 year leads, respectively. As expected, the average over the shorter forecast period of 3–5 year leads in Fig. 5a produces lower pattern correlations on average (0.25 in Fig. 5a, 0.33 in Fig. 5b) and introduces greater noise (average standard deviation of 0.29 in Fig. 5a; compared to 0.25 for the 3–7 year leads in Fig. 5b). Both show a reduction of skill due to the Pinatubo and Tavurvur/Nabro volcanic eruptions as noted in earlier studies (e.g. Meehl et al. 2015; Wu et al. 2023). Figure 5c shows the average 3–5 year lead anomaly pattern correlations that can be compared to the average 3–7 year lead time series in Fig. 5d when using the large ensemble to compute the model climatologies. As in the first two panels, use of the shorter forecast averaging period in Fig. 5c has lower average pattern correlation values compared to Fig. 5d for the 3–7 lead year average (0.28 compared to 0.35). Additionally, there is more noise on average in the anomaly pattern correlation values in Fig. 5c (standard deviation of 0.27) compared to 0.23 in Fig. 5d with the average 3–7 year lead values. Comparing the two 3–7 year lead averages in Fig. 5b and d, use of the historical average from the large ensemble compared to the drifted hindcasts produces a somewhat larger average pattern correlation value of + 0.35 in Fig. 5d compared to + 0.33 in Fig. 5b. This indicates that, for DPLE, using the historical large ensemble compared to the drifted hindcasts produces somewhat more skillful results for the Pacific region.

The 3–5 year leads from the DPLE in Fig. 5a,c can be compared to dots for the reduced set of initial years from CESM1 FOSI taken from Fig. 4b and d. The overall higher anomaly pattern correlation values noted for the previous historical 15 year climatology in Fig. 4d in CESM1 FOSI compared to the initialized hindcast climatology in Fig. 4b is reflected in similar results for the larger set of hindcasts and ensemble members for DPLE in Fig. 5a,c. That is, the 3–5 year leads using the historical large ensemble average to compute the model climatology for the previous 15 years is less noisy and has overall larger positive anomaly pattern correlations in Fig. 5c, in comparison to Fig. 5a that uses the drifted hindcasts as the reference climatology. The consistency of the larger set of initial years and ensemble members from the DPLE in Fig. 5c, compared to the smaller set of start years with fewer ensemble members (dots in Fig. 5c), supports the notion that using the average of the historical large ensemble as the reference for the previous 15 years for the model climatology presents a viable option for assessing the skill of IPO predictions in the Pacific for a small set of start years.

Another way to check for consistency in the use of the uninitialized historical large ensemble compared to the initialized drifted hindcasts is shown in Fig. 6 where the time series anomaly correlation coefficients at each grid point are calculated from the DPLE. The drifted hindcasts as model climatology (top row) and the uninitialized historical large ensemble as model climatology (bottom row) are shown for lead years 3–5 (first and third columns) and lead years 3–7 (second and fourth columns) for a climatology computed using the 15 years prior to the initial year (left two columns, panels a,b,c,d) and the total available model climatology (right two columns, panels e,f,g,h). Darker red colors indicate greater forecast skill.

Fig. 6
figure 6

Anomaly correlation time series for full 64 initial years from the DPLE for 1954 to 2017, top row using the drifted hindcasts as the model climatology for previous 15 years (panel a for lead years 3–5, panel b for lead years 3–7), and for the full drifted hindcasts for the entire DPLE period (panel e for lead years 3–5, panel f for lead years 3–7); bottom row using the historical uninitialized large ensemble from CESM1 as the model climatology for previous 15 years (panel c for lead years 3–5, panel d for lead years 3–7), and for the historical uninitialized large ensemble from CESM1 for the entire DPLE period (panel g for lead years 3–5, panel h for lead years 3–7); numbers at top left of each panel are the average anomaly correlation coefficients over the entire domain plotted

First, in agreement with previous results (e.g. Yeager et al. 2018), there is greater skill in the western tropical Pacific (with anomaly correlation coefficient values greater than + 0.8) and less skill in the eastern tropical Pacific (with anomaly correlation coefficient values less than + 0.3) using all methods and all lead years. For the Pacific basin as a whole, the average correlations over the domain (in the top left corner of each panel) indicate that there is somewhat better skill at lead years 3–5 and 3–7 using the total model climatology (Fig. 6e and f) compared to the previous 15 year climatology (Fig. 6a,b) with average pattern correlations for the former of + 0.45 and + 0.50, compared to + 0.36 and + 0.41 for the latter. As noted earlier, this is because the externally-forced trend contributes skill when anomalies are computed relative to the long-term climatology, but this trend-related skill is considerably reduced when using the previous 15 year method (Meehl et al. 2022). The bottom row panels show that there is actually greater overall hindcast skill using the historical ensemble averages for the previous 15 years as climatology compared to using the entire historical period as climatology (average anomaly correlation coefficients of + 0.39 and + 0.43 for the former, and + 0.35 and + 0.39 for the latter). This is an indication that the trend in the historical large ensemble is different from the trend in the initialized hindcasts in DPLE (Meehl et al. 2022) and the two different trends reduce overall skill. However, of note is that the overall anomaly correlation coefficients using the historical large ensemble simulations and the previous 15 years as climatology (Fig. 6c,d, with average anomaly correlation coefficients of + 0.39 and + 0.43) are comparable to the same anomaly calculation using the drifted initialized hindcasts (Fig. 6a,b, with average anomaly correlation coefficients of + 0.36 and + 0.41). This indicates that if the 15 years prior to the initial years are used as reference climatology, there is comparable skill from using the historical large ensemble simulation to compute the model climatology compared to using the drifted hindcasts as climatology for the Pacific Ocean region.

5 Discussion

The fact that there are differences in drift between initialization schemes in CESM1 and not in E3SMv1 points to differences between the two models. For example, as noted by Golaz et al. (2019), the POP2 ocean in CESM1 is more diffusive than MPAS ocean in E3SMv1 because of various model development choices and due to details of the numerics. This could partly explain why the CESM1 drift from the observed BF state is more rapid than the E3SMv1 BF or E3SMv1 FOSI. Second, POP2 in CESM1 has a lower effective resolution in the tropics than MPAS ocean in E3SMv1, and the latter has a very robust simulation of tropical instability wave (TIW) activity. Additionally, POP2 in CESM1 has a larger Redi mixing coefficient than MPAS ocean in E3SMv1. This has implications for the ENSO amplitude in CESM1 that is about twice that in E3SMv1 (Golaz et al. 2019) since a reduced Redi coefficient can decrease the ENSO amplitude (Gnanadesikan et al. 2017). In addition, the anemic AMOC in E3SMv1 (about 11 Sv at 26.5°N below 500-m depth, Golaz et al. 2019) compared to CESM1 (about 26 Sv at 26.5°N below 500-m depth, Danabasoglu et al. 2020) dominates a cold SST bias in the North Atlantic in that model since weaker AMOC in E3SMv1 transports less heat northwards and contributes to colder SSTs in the North Atlantic (Golaz et al. 2019). This bias pattern sets up quickly and robustly (regardless of initialization) as seen for the E3SMv1 results in Fig. 1. BF seems to enhance AMOC in CESM1 whereas FOSI tends to lead to AMOC weakening.

There is evidence presented here that anomalies computed using the limited start years and the uninitialized climatology can adequately represent the patterns of drifted model states after about lead year 3. Compared to the CESM1 DPLE using the conventional methodology with the large hindcast data sets, there is comparable skill in CESM1 for predicting Pacific region SST patterns indicative of the Interdecadal Pacific Oscillation with this method using the anomaly pattern correlation metric. This works in part because even if anomalies relative to the historical large ensemble have a uniform offset compared to what they would have relative to full DPLE, the skill metrics examined here (uncentered spatial or temporal correlation; centered produces similar results) remove that mean offset (i.e. correlations don’t care about temporal or spatial mean values). This method would be less effective when looking at individual forecast anomaly maps or looking at other skill metrics like mean square skill score (MSSS).

6 Conclusions

A set of hindcasts for a limited set of start years are run with CESM1 and E3SMv1 using two different initialization methods to assess whether uninitialized free-running historical simulations can be used to form the model climatologies from which to compute anomalies to evaluate prediction skill. After a few years, the drifted initialized models approach the uninitialized model climatological errors in both magnitude and pattern such that after about lead year 3, the anomalies from the limited start years can use the uninitialized climatology to represent the patterns of drifted model states in the Pacific region. Compared to the conventional methodology with large hindcast data sets, there is comparable skill for predicting the pattern of Pacific region SSTs representative of the Interdecadal Pacific Oscillation using this method. With regards to the two initialization methods, the drifts are somewhat different but are so large that by about lead year 3 the two methods are roughly comparable, though there is some model dependence to the drifts especially in the equatorial Pacific region in CESM1.

We conclude that the use of a limited number of start years for initialized hindcasts can be evaluated after about lead year 3 using the historical large ensemble as a reference climatology if using the 15 years prior to the initial years to compute the reference model climatology. Thus, for these models and for the IPO in the Pacific, indications are that it is likely not necessary to run a complete hindcast data set for every new decadal climate prediction experiment. Additionally, after about lead year 3 the differences between the two initialization methods are minimal in most regions, though there is model dependence for the agreement between the two methods.