1 Introduction

Initialised decadal predictions are commonly utilised on a time scale of up to ten lead years and have shown considerable success in the past (Meehl et al. 2009; Merryfield et al. 2020). Predictions beyond 10 years are only rarely created and analysed for initialised model predictions. Pohlmann et al. (2004) and Pohlmann et al. (2006) showed some idealised experiments on the predictability within a climate model and of the Atlantic Meridional Overturning Circulation (AMOC) for up to twenty lead years. During the coupled model intercomparison project phase 5 (CMIP5), initialised near-term experiments for up to 30 lead years were performed, but only for three initialisation years (Meehl and Teng 2012; Taylor et al. 2012; Meehl et al. 2014). A comprehensive analysis of decadal predictions beyond a 10-year prediction horizon is currently missing, despite efforts to design and operate simulations aiming at this time scale (Delworth et al. 2020).

During CMIP5, initialised decadal prediction models have demonstrated considerable skill improvements compared with uninitialised historical model runs, which only depend on the natural and anthropogenic forcing. One area of identified additional skill was the North Atlantic (Marotzke et al. 2016). In CMIP phase 6 (CMIP6) this improvement in predictability from uninitialised to initialised simulations was much less pronounced, as the skill increase in uninitialised simulations outweighed the skill increase in initialised simulations (Borchert et al. 2021). Alternatives for initialised prediction systems have been developed to combine initialised predictions with uninitialised historical simulations with post-processing steps to overcome the gap between the two kinds of simulation (Befort et al. 2022), or to explore the use of model analogues from the CMIP6 database (Menary et al. 2021). It is noted that reliable predictions in the window of ten to twenty years would be of benefit for stakeholders for long-term planning purposes.

This study will present predictions of the high-impact variable 2-metre air temperature of up to 20-year lead time created with a decadal prediction system (Polkova et al. 2019; Hövel et al. 2022) based on the CMIP6 version of the Max Planck Institute Earth system model (MPI-ESM, Mauritsen et al. 2019). We will compare these predictions to uninitialised historical and scenario simulations from the recently released Max Planck Institute grand ensemble (MPI-GE) CMIP6 (Olonscheck et al. 2023). By analysing the effect of initialisation on the AMOC we will show the long-term effect initialisation might have.

With our analysis we investigate, whether initialised decadal predictions are still of value within or after CMIP6, or whether uninitialised historical and scenario runs are sufficient to predict the future of our climate. We analyse how the initialisation impacts the simulation and if and when the skill of the initialised predictions converges to the one of the uninitialised simulation. Based on our analysis, we will discuss how long-term prediction systems can be designed and validated to disentangle the impact of external forcing and initialisation on the quality of those systems.

2 Data

We investigate initialised and uninitialised simulations with MPI-ESM version 1.2 (Mauritsen et al. 2019), in low-resolution (LR). The atmosphere is resolved horizontally with 1.875\(^\circ\) and vertically with 47 levels between surface and 0.1 hPa (Stevens et al. 2013). The ocean is resolved horizontally with a nominal resolution of 1.5\(^\circ\) and vertically with 40 levels (Jungclaus et al. 2013).

Both the initialised and uninitialised simulations are forced by the same external radiative boundary conditions, which correspond to the historical CMIP6 forcing until 2014, and the SSP2-4.5 scenario starting in 2015 (Eyring et al. 2016). The 20-year long initialised simulations are an extension of the 16-member MPI-ESM-LR decadal prediction system described in Hövel et al. (2022), from which the members of each 10-year retrospective forecast (henceforth called hindcasts) starting every November 1960–2019 has been prolonged to cover 20 years. The initialisation and ensemble generation scheme is based on a 16-member MPI-ESM-LR assimilation (1958–2020) of observed oceanic and atmospheric states into the model (Brune and Baehr 2020) using an oceanic Ensemble Kalman filter with an implementation of the Parallel Data Assimilation Framework (Nerger and Hiller 2013), and atmospheric nudging. For the uninitialised simulations we use the first 16 members of the CMIP6 MPI-GE set of simulations from Olonscheck et al. (2023). All model simulations share the same MPI-ESM model version.

In our analysis we mainly focus on 2-m air temperature, and its associated global large-scale prediction patterns. All results will therefore be presented with means of 10 degree times 10 degree boxes. We measure the prediction skill of 2-m air temperature with the anomaly correlation coefficient (ACC) of three lead year means in reference to the assimilation over with the central year chosen between 1979 to 2018, the time period common to all lead years. The model consistent assimilation simulation is chosen to prevent potential model inconsistencies to influence the results. We analyse the impact of initialisation as the difference in ACC between initialised and uninitialised simulations. The significance is estimated by a bootstrapping process including 500 iterations with a 5% significance interval. In addition, we put a second focus on the representation of the Atlantic meridional overturning circulation (AMOC), because the AMOC is known be sensitive to initialisation with a potential to carry the initialisation signal over at least a decade (Brune and Baehr 2020).

3 Results

In a first step we look at the global prediction skill of the hindcasts and investigate, how much of that can be explained due to the correct natural and anthropogenic forcing. In a second step we will analyse specific regions to identify the reasons for the results.

3.1 Multi-annual predictions up to 20 lead years

On a global scale, the prediction skill of the hindcasts shown in Fig. 1 show an incoherent picture depending on lead time. In the first three year period (lead year 1 to 3) most of the world is significantly predictable, except for some regions in the Southern Ocean. The prediction maintains significant skill through all lead years up to 20 for large parts of the Northern hemisphere as well as the Southern subtropics. However, a substantial decrease in prediction skill from lead years 3 to 5 onward can be noted in the Northeast Pacific, the tropical Pacific, the Amazon basin, India, and parts of the Southern Ocean. However, prediction skill may be as well re-emerging, e.g. with lead year 14 to 16 in parts of the Southwestern Atlantic, South Pacific, and parts of the Southern Ocean.

Fig. 1
figure 1

ACC of hindcast ensemble mean and assimilation simulation for 10 by 10 degree means of 2-m air temperature. Significant ACC by the 5% significant level are dotted. Shown are 3 year mean ACC for a 1 to 3 years; b 3 to 5 years; c 6 to 8 years; d 9 to 11 years; e 14 to 16 years and f 18 to 20 years

To estimate the effects of initialisation on the prediction skill we subtract the skill of the uninitialised simulation from the hindcast skill. In the first decade after initialisation (Fig. 2) the hindcasts show higher skill in lead years 1 to 3, especially in the tropics. However, until lead years 9 to 11 the positive initialisation impact on prediction skill can only partly be maintained, e.g. over the Northeast Pacific, the North Atlantic sub-polar gyre, Siberia, the Southern Indian ocean, and the South Pacific.

Fig. 2
figure 2

Difference of ACC of hindcast ensemble mean and ACC of unitialised simulation ensemble mean for 10 by 10 degree means of 2-m air temperature. Significant ACC differences by the 5% significant level are dotted. Shown are 3 year mean ACC for a 1 to 3 years; b 3 to 5 years; c 6 to 8 years; d 9 to 11 years

For the second decade (Fig. 3), three regions with a distinct impact of initialisation on the prediction skill (difference between hindcast skill and skill of uninitialised simulations) emerge: The Northeast Pacific with a positive impact over all lead years, the North Atlantic sub-polar gyre with a decreasing then re-emerging positive impact, and the Atlantic sector of the Southern ocean with an intensified negative impact.

Fig. 3
figure 3

Difference of ACC of hindcast ensemble mean and ACC of uninitialised simulation ensemble mean for 10 by 10 degree means of 2-metre air temperature. Significant ACC differences by the 5% significant level are dotted. Shown are 3 year mean ACC for a 12 to 14 years; b 14 to 16 years; c 16 to 18 years; d 18 to 20 years. Boxes show the areas closer investigated in Sect. 3.2

In the following we will take a closer look at three specific areas in Figs. 2 and 3. We select three different cases to demonstrate that the effect of initialisation can take different trajectories over lead time within our simulations. The North Atlantic, the South Pacific and the Section in the North Atlantic on the West Coast of America (extent see boxes in Fig. 3d).

3.2 Selected areas

3.2.1 North Atlantic sub-polar gyre

In Fig. 4 we investigate the development of skill over lead time in the North Atlantic, where initialisation in prior versions of the MPI-ESM exhibited prediction skill (Marotzke et al. 2016). We seein Fig. 4a that lead year prediction skill with the CMIP6 model and forcing versions is high in hindcast as well as the historical runs. Nevertheless, the additional skill due to initialisation can only be observed in the first time periods (up to 0.126 [0.041; 0.254]) and significantly increased just for the earliest time period. Afterwards the hindcast skill moves slightly below the unitialised simulations, before it slightly rebounds again.

In Fig. 4b we compare the mean drift corrected hindcast mean for lead year 18 to 20 with the uninitialised simulation. We can see that the changes in the North Atlantic are well represented by runs including the correct forcing. While both simulations show an increase up to around 2007, they stabilise afterwards, while the assimilation simulation actually decreases.

Fig. 4
figure 4

Analysis of the selected area in the North Atlantic. a Lead year plot with hindcast ensemble mean (red), uninitialised simulation ensemble mean (blue), difference of hindcast and historical (dark green) and 2.5 and 97.5 percentile of the bootstrapped uncertainty of the difference (light green). b Mean lead year 18 to 20 time series of hindcast ensemble mean (red), uninitialised simulation mean (blue) and assimilation ensemble mean (black) for the time period of 1979 and 2018 for lead year 19 as the central value of the covered lead year time span

3.2.2 Atlantic sector of the Southern Ocean

Figure 5a shows the analysis of the region in the Southern Atlantic, where the negative prediction skill moved to over the lead time horizon. It shows a sharp decrease of prediction skill after the first years (staring at 0.154 [\(-\) 0.108; 0.463]), before it stabilises for multiple years (at around 0.30). Only then the difference of the hindcast skill compared to the uninitialised simulation skill becomes significantly negative, before it stabilises again (at around \(-\) 0.90). It demonstrates that initialisation has effects for time period longer than ten years lead time.

We can see in Fig. 5b for the prediction for lead years 18 to 20 years that especially in the last ten years from 2010 onward the hindcast predicted on these time scales a decline, while the assimilation run shows a stabilisation in a higher level. The strength of the uninitialised simulation is the continuous increase, which roughly fit the assimilation run. An explanation can be seen therein that a signal from the initialisation is still present in the hindcast simulation, while the historical simulation follows the path determined by the external forcing. A clear mismatch can be found around 2003, where both hindcast and uninitialised simulation predict an increase, while the assimilation run clearly dips. Here it can be concluded that such dynamical changes are not predictable on these time scales by the hindcast simulation and are not determined by the external forcing.

Fig. 5
figure 5

Analysis of the selected area in the South Atlantic. a Lead year plot with hindcast ensemble mean (red), uninitialised simulation ensemble mean (blue), difference of hindcast and uninitialised simulation (dark green) and 2.5 and 97.5 percentile of the bootstrapped uncertainty of the difference (light green). b Mean lead year 18 to 20 time series of hindcast ensemble mean (red), uninitialised simulation mean (blue) and assimilation ensemble mean (black) for the time period of 1979 and 2018 for lead year 19 as the central value of the covered lead year time span

3.2.3 Northeast Pacific

Already Wiegand et al. (2018) have demonstrated that in the Northern Pacific trends in the Pacific Decadal Oscillation are predictable up to lead times of 10 years, without large variability in the prediction skill over time. In Fig. 6a we see that in the Northeast Pacific the hindcast skill is consistently and significantly higher than the low skill of the uninitialised simulation (at around 0.20 increasing to 0.35). When we analyse the prediction on the long scale itself (Fig. 6b) we see that both, the hindcast and the historical simulation show an increase in skill. But while after 2010 the hindcast skill is still increasing constantly, the skill of the uninitialised simulation is slightly decreasing. The large increase in the later part of the assimilation part after 2015 contributes therefore to the success of the hindcast simulation. Other sections of divergence can be found in the 1990s. In this time period, the assimilation run exhibits a positive anomaly, while both simulations show a negative anomaly. This could be partly related to the difference in impact of the 1991 Pinatubo volcanic eruption on decadal climate simulations and observations in the tropical Pacific (e.g., Timmreck et al. 2015; Wu et al. 2023). Nevertheless, the decrease in values is lower for the hindcast simulations than the uninitialised one. This is repeated with opposite signs in the early 2000s. We note that a similar consistent improved predictability by initialisation can be observed in the Indian Ocean.

Fig. 6
figure 6

Analysis of the selected area in the Northeast Pacific. a Lead year plot with hindcast ensemble mean (red), uninitialised simulation ensemble mean (blue), difference of hindcast and uninitialised simulation (dark green) and 2.5 and 97.5 percentile of the bootstrapped uncertainty of the difference (light green). b Mean lead year 18 to 20 time series of hindcast ensemble mean (red), uninitialised simulation mean (blue) and assimilation ensemble mean (black) for the time period of 1979 and 2018 for lead year 19 as the central value of the covered lead year time span

3.3 Atlantic meridional overturning circulation

The Atlantic Meridional Overturning Circulation (AMOC) is a main driver behind the decadal evolution of oceanic in and atmospheric temperature over the Atlantic, and potentially carries memory in the climate system on decadal timescales and possibly longer (e.g., Delworth et al. (1993); Latif et al. (2006); Yeager and Robson (2017)). We therefore complement our study with an analysis of the 20-year impact of the initialisation on the representation of the full AMOC cell, as well as on the AMOC at 26\(^\circ\)N, where observations exist since 2004 (Moat et al. 2020).

In our hindcasts, the general pattern of the AMOC cell remains almost unchanged for all years after initialisation (Fig. 7). The initialisation lays down the prevailing pattern of the AMOC cell. There is only a slight drifting back toward the pattern of the uninitialised mean AMOC cell within the 20 years of hindcast simulation: an upward shift of the boundary between upper and lower circulation cell over lead time. The overall strength of the overturning cell does change over lead time (compare Fig. 7c, f). There is a weakening trend in most parts of the AMOC cell. South of the equator, this trend drives the AMOC strength very slowly back toward the one of the uninitialised simulation. North of the equator, the weakening becomes apparent as well, e.g. for the AMOC at 26\(^\circ\)N (Fig. 8). Around 10 years after initialisation, and uniformly over the whole hindcast period, the AMOC at 26\(^\circ\)N has shifted to a mean state considerably different from the earlier lead years, accompanied by a sharp drop in inter-annual variability. For lead years 10 and larger, the AMOC mean strength (14 Sv) and inter-annual variability are lower than for earlier lead years (15–17 Sv mean strength). However, in terms of AMOC strength and variability at 26\(^\circ\)N, there is no apparent drift toward the uninitialised simulation within the 20-year hindcasts over the whole hindcast period.

Fig. 7
figure 7

Time mean AMOC cell for the uninitialised simulation 1980–2019 (a), and 2081–2100 (b), the assimilation 1980–2019 (c), the hindcasts 1980–2019, averaged over lead years 4–6 (d), over lead years 9–11 (e), and over lead years 18–20 (f)

With the temperature and salinity assimilation used in our decadal prediction system, we indirectly initialize the AMOC, which due to its long-term memory in terms of AMOC cell and related water and heat transports, has consequences for the temperature evolution over the whole of the Atlantic Ocean. Effects of initialisation are easily detectable 10 to 20 years later. Depending on the focus and quality of the initialisation, surface temperature predictability might benefit (e.g. in the North Atlantic subpolar gyre) or suffer (e.g. South Atlantic) from initialisation, see Fig. 3d.

Fig. 8
figure 8

Time series of AMOC at 26\(^\circ\)N for the uninitialised simulation (blue), the assimilation (black), the hindcast lead years 5 (yellow), 10 (orange), 20 (red), and for the observations from RAPID (Moat et al. 2020, dashed pink). A running mean of 3 years has been applied

4 Discussion

Our results show that on timescales for up to twenty years, initialisation effects in terms of prediction skill of 2-m air temperature are rapidly decreasing over lead time for most regions. Most of the significant prediction skill seems to be contained in uninitialised simulations as well and is thus driven by the correct knowledge of the natural and anthropogenic forcing (Borchert et al. 2021; Menary et al. 2021). Nevertheless, our results also demonstrate that some regions in fact show an impact of initialisation for lead times up to 20 years. In addition, in our hindcasts the AMOC clearly carries the initialisation impact even beyond a 20 year lead time. We identify a small drift in the hindcasts towards the uninitialised simulation, but this process has not been completed until lead year 20. Pushing the model towards the observed climate state during initialisation leads in our experiment to a long-term decline of the AMOC. This potentially changes the surface temperature not only in the North, but also in the South Atlantic on time scales up to and longer than 20 years. We have identified such an effect in the difference of prediction skill between hindcasts and uninitialised simulations over this time frame. In a similar way, initialising the model closer to the observed state of the Pacific Decadal Oscillation may lead to a significant prediction skill in the Northeast Pacific for more than twenty years, where the historical simulation shows none.

These results show, that effects of initialisation may not be lost after 20 years and probably for longer in the model. They also highlight that knowing the correct forcing is important. The model is well able to represent the effects of the forcing leading to acceptable results, either with or without an initialisation. Nevertheless, the resulting dependence on knowing the correct external forcing lead to questions on how we create hindcasts and how we evaluate their success. The longer the lead time from the initialisation, the more uncertainties are associated with the forcing (Meehl et al. 2014; Matthes et al. 2017; Gidden et al. 2019). Thus the representativeness of the comparison to the historical simulation has to be questioned. For a better verification of hindcast skill and the validity of statements based on future predictions a different approach for hindcast simulations would be required. In this the forcing would have to be assumed at each initialisation point for the upcoming decades so that only information is used that is available before the point of initialisation. This approach would be more akin to the decadal prediction experiments described in Smith et al. (2010), Robson et al. (2012) and Timmreck et al. (2015) (albeit with CMIP5 forcing) and to the “Tier 4” experiments of the Decadal Climate Prediction Project (DCPP) outlined in Boer et al. (2016) (with CMIP6 forcing, but aiming only at five lead years). We expect in such experiments that hindcast skill would drop considerably (e.g. due to missing volcanoes, Timmreck et al. (2015); Wu et al. (2023)), but those experiments would demonstrate the real benefit of initialisation. This cannot be seen when the simulations include the prescribed forcing and is compared to an uninitialised model with the same forcing. In the current experiments we cannot identify whether the skill of a prediction is generated due to forcing or due to the correct initialisation of physical processes. Consequently, in our results we see that in both cases the skill is high. Comparing initialised to uninitialised predictions in the current format does not offer the necessary insights into the ability of the initialised prediction in a real forecast scenario, especially on longer lead year horizons.

A common assumption in decadal predictions (e.g. Branstator and Teng (2012), Bilbao et al. (2021), Volpi et al. (2021)) is that the effect of initialisation will diminish over time and the model regresses back to something comparable to the uninitialised model state. This might not be exactly consistent to uninitialised simulations with the model, due to model drift, but is often assumed similar to it. The above results suggest that this convergence is not necessarily happening. Changes to the ocean circulation during initialisation, however carefully they might be done, may impact the model beyond the decadal time scale. This impact is not necessarily consistent with the assumption of a correctable model drift. We have found this especially in case of the AMOC. The simulated AMOC is mainly a result of proper model tuning prior to the historical phase and its transient reaction on the applied external forcing during the historical and scenario phases. We also demonstrated that performing assimilation experiments for our decadal prediction system can change the pattern of AMOC considerably. We do not see that effect in historical and scenario simulations. Nevertheless, it is open whether considerable changes to the external forcing like volcanic eruptions could also have the potential to effect the AMOC cell structure like the initialisation done here. The AMOC decreases in most scenario simulations of various models for the upcoming decades (Weijer et al. 2020). Further investigations are required, whether the impacts on the AMOC we have identified in this study might be helpful to understand better this phenomenon and the uncertainties associated with it.

5 Conclusion

The study has evaluated an initialised decadal prediction system with a lead time for up to twenty years. It is demonstrated that skill for surface temperature compared to an uninitialised simulation can take different trajectories for different areas in the world. While in the Northeast Pacific the additional skill does not change over time, in other areas like the South Pacific a decrease in skill only materialises after more than ten lead years. We also showed that evolution within our prediction system up to a lead time of ten years might not be extrapolated to a longer lead time. Therefore, the overall assumption that once the added value of initialisation is gone, the prediction system will drift towards the results from the uninitialised model, cannot be generalised. To improve the comparability of hindcasts with real forecasts, we suggest that new hindcast experiments are designed to separate the effects of the initialisation and external forcings.