Seasonal predictions initialised by assimilating sea surface temperature observations with the EnKF

This study demonstrates that assimilating SST with an advanced data assimilation method yields prediction skill level with the best state-of-the-art systems. We employ the Norwegian Climate Prediction Model (NorCPM)—a fully-coupled forecasting system—to assimilate SST observations with the ensemble Kalman filter. Predictions of NorCPM are compared to predictions from the North American Multimodel Ensemble (NMME) project. The global prediction skill of NorCPM at 6- and 12-month lead times is higher than the averaged skill of the NMME. A new metric is introduced for ranking model skill. According to the metric, NorCPM is one of the most skilful systems among the NMME in predicting SST in most regions. Confronting the skill to a large historical ensemble without assimilation, shows that the skill is largely derived from the initialisation rather than from the external forcing. NorCPM achieves good skill in predicting El Niño–Southern Oscillation (ENSO) up to 12 months ahead and achieves skill over land via teleconnections. However, NorCPM has a more pronounced reduction in skill in May than the NMME systems. An analysis of ENSO dynamics indicates that the skill reduction is mainly caused by model deficiencies in representing the thermocline feedback in February and March. We also show that NorCPM has skill in predicting sea ice extent at the Arctic entrance adjacent to the north Atlantic; this skill is highly related to the initialisation of upper ocean heat content.


Introduction
In recent decades, it has been proven that tailored seasonal climate forecasts are more beneficial than climatology for decision-making in many sectors of society, e.g. energy, agriculture, transport, insurance and water resource management (Soares and Dessai 2015). For example, Torralba et al. (2017) recently demonstrated that skilful seasonal predictions of near-surface air temperature and wind speed in winter improve the predictability of wind energy demand and supply; Gunda et al. (2017) demonstrated that using seasonal forecasts increases the agricultural income. To fulfil societal needs of seasonal climate forecasts, many meteorology centres (e.g., the European Centre for Medium-Range Weather Forecasts, ECMWF) and climate prediction centres (e.g., the Climate Prediction Center of the National Centers for Environmental Prediction, NCEP) have been providing seasonal forecasts operationally for many years; and many others are developing operational systems.
Seasonal climate predictions ) are performed by statistical models (van den Dool 2006), 1 3 dynamical models (Ji et al. 1996) or a combination of both (Krishnamurti et al. 1999). Dynamical models, such as ocean-atmosphere coupled general circulation models (CGCMs), are more suitable than statistical models to deal with unprecedented climate signals and chaotic behaviour in the climate system (Barnston et al. 1999). Most climate forecasting systems (e.g. systems shown in Weisheimer et al. 2009;Kirtman et al. 2014) deliver ensemble forecasts, where the forecast and its uncertainty are provided by the statistics of the ensemble, including the mean, median, variance and range. The ensemble forecasts are mainly based on a single dynamical model and generated by perturbing initial conditions and forcing, and through stochastic perturbations of model physics. Multimodel or perturbing model physics approaches were proposed to enhance the prediction skill by reducing the errors caused by initialisation and model physics (Palmer et al. 2004). A successful example is the North American Multi-Model Ensemble (NMME, Kirtman et al. 2014) operational forecasting system that consists of more than 13 CGCMs from US and Canadian modelling centres with ensemble members in individual CGCMs ranging from 6 to 24. A combination of statistical and dynamical models is an alternative for the seasonal forecasts tailored to certain specific needs. For instance, Gleixner et al. (2017) showed that a statistical-dynamical model beats a multimodel ensemble of 11 CGCMs for the seasonal prediction of the Kiremt rainfall. Analogs based on dynamical models has also introduced competitive means to perform seasonal predictions (Ding et al. 2018).
Initial conditions fed to CGCMs largely dominate dynamical climate predictions at short time scales (Kirtman et al. 2013;Doblas-Reyes et al. 2013). Balmaseda and Anderson (2009) evaluated three commonly used initialisation strategies (including atmosphere or/and ocean initialisations) in the ECMWF System3 for seasonal climate predictions. They demonstrated that ocean initialisation has a significant impact on seasonal predictions of CGCMs and recommended to initialise CGCMs with both ocean and atmosphere data. Most current ocean initialisations used for seasonal forecasting systems ) assimilate sea surface temperature (SST) observations, since SST plays a key role in influencing atmospheric circulations (Shukla 1998). They also assimilate many other oceanic observation types, e.g. subsurface temperature and salinity and altimeter data that have been demonstrated to improve seasonal predictions (Balmaseda and Anderson 2009).
As reviewed in  and Penny et al. (2017), most operational centres are evolving towards the use of sophisticated initialisation schemes in which observations in different components are best used with advanced data assimilation (DA) methods such as the ensemble Kalman filter (EnKF, Evensen 2003), the 4-dimensional variational DA (4DVAR, Dimet and Olivier 1986) and hybrid methods (Hamill and Snyder 2000;Carrassi et al. 2018). However, some previous studies (e.g., Luo et al. 2005;Zhu et al. 2017) demonstrated that a simpler initialisation scheme (i.e., nudging SST towards observations) already achieves skilful seasonal predictions. The mechanism for this scheme is that nudging SST towards observations reproduces well the variability of temperature in the mixed layer and the variability of subsurface temperature though air-sea interaction (Luo et al. 2005;Keenlyside et al. 2005;Kumar et al. 2014;Kumar and Zhu 2018). The method is most effective in the tropical Pacific where air-sea interaction is strongest, and can be problematic in high-latitudes and lead to spurious impacts on ocean thermohaline circulation in some models (Dunstone and Smith 2010). Nevertheless, the SST nudging scheme can reduce errors caused by inconsistencies between separately initialised components of CGCMs (i.e., initialisation shock). Luo et al. (2005) used the SST nudging initialisation scheme in the Scale Interaction Experiment-Frontier Research Center for Global Change (SINTEX-F) and achieved skilful predictions of the El Niño-Southern Oscillation (ENSO) up to 12 months. Zhu et al. (2017) even showed that such simple initialisation scheme could achieve a comparable seasonal SST prediction skill to the averaged skill of the NMME models that use more data (e.g., oceanic subsurface data and atmospheric data) and/or more advanced DA methods for initialisation.
The main aim of this study is to assess whether assimilating SST observations with an advanced DA method can be as or even more competitive than current state of the systems that assimilate more data. To do so, we will work with the Norwegian Climate Prediction Model (NorCPM, Counillon et al. 2014) which aims to provide long-term reanalyses and seasonal-to-decadal climate predictions. NorCPM uses the EnKF, which is an advanced and flowdependent DA method. Counillon et al. (2016) showed that upper ocean heat content in the equatorial and north Pacific, the north Atlantic subpolar gyre region and the Nordic Seas can be well constrained by assimilating SST anomalies (SSTAs) with the EnKF. Since seasonal predictions largely depend on upper ocean heat content (Balmaseda and Anderson 2009;), we expect that assimilation of SSTA with the EnKF achieves skilful seasonal predictions. In addition, this paper is the first demonstration of the seasonal prediction skill of Nor-CPM in a real-experiment framework. Note that although Counillon et al. (2014) studied the seasonal prediction of NorCPM, their study was carried out in a twin-experiment (also known as a perfect model) framework in which the 'truth' was an independently run simulations of the same model. Pseudo observations were generated from the 'truth' and used to initialise the model and assess the (perfect model) prediction experiments. This paper is organised as follows. The model, experimental design and data used for validation are described in Sect. 2. The global prediction skill of NorCPM is presented and compared to the NMME systems in Sect. 3. The ENSO prediction skill is investigated in Sect. 4. The prediction skill of regional Arctic sea ice extent (SIE) is assessed in Sect. 5.

Norwegian climate prediction model
NorCPM (Counillon et al. 2014) is a climate prediction system developed for seasonal-to-decadal climate predictions and long-term reanalyses. It combines the Norwegian Earth system model (NorESM, Bentsen et al. 2013) and the EnKF (Evensen 2003). NorCPM is unique in the sense that it uses a global isopycnal ocean model (i.e., MICOM, Bentsen et al. 2013) and an advanced error flow-dependent DA method (i.e., the EnKF).
NorESM (Bentsen et al. 2013) is a global fully-coupled model for climate simulations. It is based on the Community Earth System Model version 1.0.3 (CESM1, Vertenstein et al. 2012), a successor to the Community Climate System Model version 4 (CCSM4, Gent et al. 2011). In NorESM, the ocean component is an updated version (Bentsen et al. 2013) of the isopycnal coordinate ocean model MICOM (Bleck et al. 1992); the sea ice component is the Los Alamos sea ice model (CICE4, Gent et al. 2011;Holland et al. 2012); the atmosphere component is a version of the Community Atmosphere Model (CAM4-Oslo, Kirkevåg et al. 2013); the land component is the Community Land Model (CLM4, Oleson et al. 2010;Lawrence et al. 2011); the version 7 coupler (CPL7, Craig et al. 2012) is used.
In this study, we employ the version of NorESM that is included in the Coupled Model Intercomparison Project Phase 5 (CMIP5, Taylor et al. 2012). MICOM has a horizontal resolution of approximately 1 • , 51 isopycnal layers and 2 additional layers for representing the bulk mixed layer; CICE4 is on the same grid as MICOM; CAM4 has a horizontal resolution of 1.9 • at latitude and 2.5 • at longitude and 26 vertical levels in a hybrid sigma-pressure coordinate; CLM4 shares the same horizontal grid as CAM4. We briefly introduce the model performance of NorESM with respect to the ENSO (Sect. 4) and the Arctic sea ice (Sect. 5) in the following paragraphs.
The version of NorESM used in this paper features positive SST biases in the eastern boundary upwelling regions in the tropics that are common in CMIP5 models (Richter 2015) and positive precipitation biases related to the double Intertropical Convergence Zone (ITCZ, Bentsen et al. 2013).
The meridional width of the tropical Pacific cold tongue is also too narrow (Bentsen et al. 2013). Despite these common shortcomings, NorESM is evaluated as one of better models for simulating ENSO (Bellenger et al. 2014). NorESM reproduces realistic SST variability in the Nino3 region (5 • S-5 • N, 150 • W-90 • W) and the Nino4 region (5 • S-5 • N, 160 • E-150 • W). The power spectrum, seasonality and spatial structure of ENSO in NorESM are also comparable to observations (Bellenger et al. 2014). In addition, NorESM simulates well ENSO teleconnection patterns, such as interannual variations of the relationship between ENSO and the Asian summer monsoon (Sperber et al. 2013).
The version of NorESM used in this paper reproduces a fairly realistic geographic distribution of the Arctic sea ice. However, the summer Arctic SIE is too large, since the summer melting is too slow. It is likely linked to the too thick sea ice in NorESM, in particular in the polar oceans adjacent to the Eurasian continent. The general thick Arctic sea ice in NorESM is due to too little summer melt of snow (i.e., too little surface melt of the ice). Moreover, the winter melting in NorESM lags the observed melting. For further details see Bentsen et al. (2013).

Experimental design
We base the prediction skill assessment of the system on hindcasts (i.e., retrospective predictions), similar to many previous studies (e.g., Luo et al. 2015;Zhu et al. 2017). Seasonal hindcasts start on the 15th of January, April, July and October each year during 1985-2010. Totally, there are 104 hindcasts (26 years and 4 season starts per year). Each hindcast consists of 9 realisations (ensemble members) and is 13 months long. The hindcasts are forced by CMIP5 historical forcings (Taylor et al. 2012) Wang et al. 2005), volcanic sulphate aerosol concentration (Ammann et al. 2003), GHG concentration , aerosol emission , and land-use (Hurtt et al. 2009). Initial conditions are taken from the first 9 out of the 30 ensemble members 1 of a reanalysis covering the period of 1980-2010. In this reanalysis we use the same model and DA settings to Counillon et al. (2016). We assimilate monthly SSTA data from Hadley Centre Sea Ice and Sea Surface Temperature dataset version 2.1 (HadISST2.1.0.0, Kennedy et al., personal communication; Rayner et al., personal communication) with the EnKF into the ocean component of NorCPM at each assimilation step; the other components are dynamically adjusted during the system integration after the assimilation step. Note that we perform an anomaly assimilation with climatology defined for the period 1980-2010. The initial conditions of the reanalysis are taken from a 30-member historical simulation ensemble of NorESM that was integrated from 1850 to 2010 using CMIP5 historical forcings (Taylor et al. 2012). For further details see Counillon et al. (2016).

Data
In order to evaluate the prediction skill of NorCPM in a large framework, we compare the NorCPM hindcasts to the NMME hindcasts (Kirtman et al. 2014). The NMME is a multi-model seasonal forecasting system, that consists of several coupled climate models with different setups from US and Canadian modelling centres. For further details see Kirtman et al. (2014) or https ://www.earth syste mgrid .org/ searc h.html?Proje ct=NMME. In this study, we select 13 NMME systems which provide SST hindcasts from 1985 to 2010 (Table 1). All NMME hindcasts start on the first day of each month and have lead times up to 8-12 months. Note that the NorCPM hindcasts start 15 days earlier than the NMME hindcasts (e.g., our hindcasts starting on the 15th of April are compared to the hindcasts of NMME starting on the 1st of May). The ensemble size ranges from 6 to 24 among the NMME models. Here we use the first 9 ensemble members of each NMME model (except for CCMS3 that only provides 6 ensemble members) to have a comparable ensemble size to NorCPM. The NMME hindcast data are provided as monthly means with a horizontal resolution of For the validation of the hindcasts, we take SST data from the National Oceanic and Atmospheric Administration (NOAA) Optimum Interpolation SST (OISST) version 2 with a 1 • resolution (Reynolds et al. 2002) which is not the SST dataset used to initialise the hindcasts of NorCPM. Note that the depth of SST is different in OISST, NorCPM and NMME systems. SST data in OISST determined from in situ observations (e.g., ships and buoys) and satellite data (e.g., Advanced Very High Resolution Radiometer) are the temperature at depths from a micro to several meters (Reynolds et al. 2002). The depth of SST in a model depends on vertical resolution (Table 1). For example, SST in the ocean component of CanCM3 and CanCM4 is the temperature at the first vertical level from the ocean surface to 10 m (Merryfield et al. 2013). SST in NorCPM is the temperature in the top layer (from 1 to 10 m) of the 2 layers for representing the bulk mixed layer and its depth varies in time (Bentsen et al. 2013). Although the difference of the depth of SST in observations and models may influence results, we do not take it into account in this paper. Monthly sea surface height (SSH) anomaly data (sea level anomalies) are taken from the Global ARMOR3D L4 Reprocessed dataset (http://marin e.coper nicus .eu, available from 1993 to present) produced by Ssalto/Duacs and distributed by Aviso with support from the Centre national d'études spatiales (www.aviso.oceanobs. com/duacs/). The product is gridded to a resolution of 1 • in order to have a comparable resolution to NorCPM. Precipitation observations are the combined monthly precipitation dataset of the Global Precipitation Climatology Project (GPCP) version 2.2 (https ://clima tedat aguid e.ucar. edu/clima te-data/gpcp-month ly-globa l-preci pitat ion-clima tolog y-proje ct). They are available on a 2.5 • × 2.5 • grid from 1979 to present. Other atmospheric data are derived from the NCEP reanalysis data (Kalnay et al. 1996) provided by the NOAA Physical Sciences Division, Boulder, Colorado, USA, from their Web site at https ://www.esrl.noaa.gov/psd/. Oceanic subsurface data is taken from the EN4 objective analysis (EN4.1.1, Good et al. 2013). Sea ice concentration data is taken from HadISST2.1.0.0 (Torralba et al. 2017). Note that SST data in the regions covered by sea ice are not assimilated; the regions are defined by the sea ice data provided by HadISST2.1.0.0. In the following sections, the anomaly calculations are based on the reference period 1985-2010.

Global prediction
In this section, we assess the skill of NorCPM in predicting global SST, air temperature at 2 m (T2m) and precipitation and compare its performance to that of the NMME (Sect. 2.3). Skill is measured by anomaly correlation coefficient (ACC) and the bias-free root mean square error (RMSE; i.e., computed with anomalies) between hindcasts and observations. Figures 1 and 2 show the global ACC and RMSE maps of T2m at 6-and 12-month lead times and precipitation at 3-month lead time. NorCPM and NMME have skill in predicting T2m at 6-and 12-month lead times with higher ACC and lower RMSE over the oceans and in the tropics. The skill of NorCPM is higher than the skill averaged over the NMME systems. In terms of precipitation, both NorCPM and NMME have skill at 3-month lead time over the tropical Pacific. Table 2 shows the global average of ACCs and RMSEs for T2m at 6-and 12-month lead times and precipitation at 3-month lead time. Note that we compute ACC for the individual NMME systems and the average of the ACCs of the NMME systems is shown in Table 2. For skill at 12-month lead time, we only use the 8 NMME systems providing data. NorCPM has higher ACCs and lower RMSEs than the average of ACCs and RMSEs of the NMME for T2m and precipitation on the global scale.
In the following, we focus on assessing the skill in predicting SST. Since SST is very sensitive to the air-sea Table 1 Brief description of NorCPM and 13 NMME models used in this paper  interaction, it is suitable for monitoring the surface ocean state changes (Shukla 1998;Deser et al. 2010). Furthermore, SST has been very well observed by satellites since 1982 and SST datasets are commonly used to assess forecasts. We also expect the best predictive skill for SST, as this quantity is used to initialise our hindcasts (Sect. 2.2). Figure 3 shows ACCs between SSTAs from OISST and NorCPM and the averaged ACCs between SSTAs from OISST and the individual NMME systems at 6-and 12-month lead times. Similar to the NMME systems, Nor-CPM has high ACCs in the tropical Pacific. Furthermore, NorCPM is skilful (ACCs > 0.6) in the tropical western Atlantic and the Iceland Basin. Overall, the ACCs of Nor-CPM are higher than the ACCs averaged over the NMME systems in most regions.
It is of great interest to disentangle the part of prediction skill related to natural internal variability from the part driven by the external forcing (Sect. 2.2). To do so, we split the ACC of NorCPM into two parts as follows: where var(⋅) is the variance over time, is the observed SST, 1 is SST taken from the hindcasts initialised by the NorCPM reanalysis (Sect. 2.2), 2 is SST taken from the uninitialised hindcasts and = 1 − 2 (i.e., the part added by DA). The uninitialised hindcasts are taken from the 30-member free running NorCPM model that is a continuous run from 1980 to 2010 without the DA (named the NorCPM free run). This is equivalent to running hindcasts for each start date from the NorCPM free run. Note that 1 is based on the ensemble of 9 members while 2 is based on the ensemble of 30 members. A large ensemble is required to average out the internal variability in each ensemble member of the free run to robustly reveal the externally forced variability in the ensemble mean (Solomon et al. 2011). While a large ensemble can be desirable to isolate the initialised component (Scaife and Smith 2018), we have restricted our initialised hindcast ensemble size to 9 because of computational limitations. Part 1 can be considered as the skill due to external forcing and Part 2 can be considered as the added skill due to DA. Note that it is not possible to fully distinguish between skill due to external forcing and DA (Solomon et al. 2011), since the DA also corrects the contribution of external forcing. Figure 4 presents the parts of the decomposition of the NorCPM ACCs at 6-month and 12-month lead times (top panels in Fig. 3). We find that the prediction skill of Nor-CPM in the tropical Pacific related to the high prediction skill of ENSO (Luo et al. 2015) is due to the DA. The prediction skill in the tropical western Atlantic and the Iceland (1) Basin is related to both external forcing (left panels in Fig. 4) and the DA (right panels in Fig. 4). Note that most NMME systems include external forcing as well (Table 1). However, NorCPM is based on a climate model, and as such is developed to well capture the climate response to external forcing (Bentsen et al. 2013). Figure 5 shows RMSEs of NorCPM SSTAs and the averaged RMSEs of SSTAs of the NMME systems at 6-and 12-month lead times. Like the NMME, NorCPM has higher RMSEs in the east-central equatorial Pacific, the North Pacific, the Gulf Stream and the Kuroshio Stream than in other regions, because of the larger (observed and predicted) SST variability in these regions. Nevertheless, the RMSEs in NorCPM are generally lower than the RMSEs averaged over the 13 NMME systems (Table 2) in particular in the Gulf and Kuroshio Streams.  The above results show that NorCPM has a skill of global SST predictions higher than the averaged skill of the NMME. In the following, we compare NorCPM to individual NMME systems. To do so, we define four bins that relate to the ACC or RMSE. For the ACC, the bins are distinguished by a significance test (Fisher z-transformation) at a 5% significance level. For the RMSE, we use a threshold of 0.1 • C. These four skill bins are described as follows: • SIMILAR all models have a similar prediction skill (no significant difference of ACCs or less than 0.1 • C difference of RMSEs); • BEST NorCPM leads to a prediction skill higher than or equal to the highest skill of the NMME systems; • INTERMEDIATE NorCPM leads to a prediction skill lower than the highest skill but higher than the lowest skill of the NMME systems; • WORST NorCPM leads to a prediction skill lower than or equal to the lowest skill of the NMME systems. Figure 6 shows the ranks of NorCPM among the 13 NMME systems with respect to ACC and RMSE. The proportions of the four skill bins are presented in Table 3. The results indicate that NorCPM belongs to the SIMILAR/ BEST skill bins in most oceanic regions. In the tropical western Atlantic and in the region that extends from the Iceland Basin to the Barents Sea, NorCPM belongs to the BEST bin for both ACC and RMSE up to 12-month lead time. This analysis reveals that NorCPM is among the better systems of the NMME for predicting global SST at 6-and 12-month lead times. This confirms that skilful seasonal predictions using CGCM can be achieved by initialisation approaches using only SST observations as shown in previous studies (Luo et al. 2005;Keenlyside et al. 2005;Zhu et al. 2017).
Our better skill can be because of better data, better initialisation scheme, or a better model. Use of better data is an unlikely reason, because we use similar SST products to the NMME systems that mostly use more ocean data (Table 1). Without a detailed comparison it is difficult to be conclusive regarding differences in model performance. However, we use a climate model not specifically developed for seasonal prediction, while the NMME systems have been mostly developed for this task. Hence, it is unlikely at least compared to the NMME models that our good performance is because of a superior model. Thus, it is most likely our initialisation scheme is the main reason for the on average better skill of our system compared to the NMME. The advanced DA method we used (i.e., the EnKF) is a multivariate and flow-dependent DA method (Evensen 2003). When assimilating SST observations, it can effectively propagate information to other ocean state variables than SST (e.g., mixed layer depth and upper ocean heat/salt contents, Counillon et al. 2016) making a better use of SST data to estimate initial conditions for hindcasts. Furthermore, most NMME systems perform their hindcasts in a full field initialisation framework from uncoupled GCMs and reanalyses produced by the other models (Table 1). This will yield initialisation shock and model drift that will degrade prediction skill. This is not a major issue in our hindcasts, as we use an anomaly initialisation and use the same coupled model to generate initial conditions and perform hindcasts. Fig. 4 Parts in the decomposition of the NorCPM ACCs at 6-month (top panels) and 12-month (bottom panels) lead times. Please refer to Eq. (1) Fig. 3, but for the RMSE ( • C) of SSTA It is worth noting that a low-resolution CFSv2 with SST nudging scheme (Zhu et al. 2017) achieved a skill comparable to the averaged skill of the NMME while the skill of NorCPM is better than the averaged skill of the NMME. The SST nudging and multivariate EnKF schemes work well in regions where two-way ocean atmosphere interaction is strong, such as the tropical Pacific. We expect our scheme can perform better than SST nudging for the following reasons: firstly, the multivariate EnKF scheme can work in regions where there is a strong anti-correlation between surface and subsurface variability and air-sea coupling is weak (e.g., the extra-tropics). Second, the SST nudging scheme is not necessarily able to maintain a consist relation between temperature and salinity. This can lead to spurious behaviour through impacts on buoyancy (Dunstone and Smith 2010). This can also be important in the tropical Pacific warm pool (Maes et al. 2005) and for ENSO prediction (Hackert et al. 2011). The EnKF captures these physical relations because it makes use of covariances (Counillon et al. 2016). Thus, in general we expect that a nudging scheme would not perform as well as our advanced assimilation scheme, but we have not performed any additional experiments to confirm this.

Fig. 5 As in
In the following sections, we will focus on the prediction skill of ENSO in the tropical Pacific and sea ice at the entrance of the Arctic adjacent to the north Atlantic. The tropical Pacific is selected because it is a region where prediction systems perform well and the Arctic where seasonal predictions are less developed. In the ENSO region, we will assess the reason why NorCPM is skilful for ENSO predictions. At the entrance of the Arctic where NorCPM seems to be one of the better models, we will assess the reasons for the skill of NorCPM in predicting sea ice.

ENSO prediction
Nowadays, climate prediction systems can typically provide skilful ENSO prediction 6-9 months ahead (e.g., Zheng et al. 2006;Zhu et al. 2017). One notably exception is the SINTEX-F model, which can successfully predict ENSO up to 2 years ahead (Luo et al. 2015). We consider the Nino3.4 SSTA index (SSTA in the region, 5 • S-5 • N and 120 • W-170 • W) for ENSO predictions. The green lines in Fig. 7 show the time series of Nino3.4 SSTA predicted 6 and 12 months ahead by NorCPM. As reported in Jin et al. (2008), the stronger ENSO events are more predictable than weaker or neutral ones. El Niño and La Niña events are well predicted 6 months ahead by NorCPM in terms of their timings and amplitudes. The ensemble spread of Fig. 6 Ranks of NorCPM among the NMME systems for the ACC and RMSE at 6-month or 12-month lead time. Please refer to Sect. 3 for the definition of the four skill bins: SIMILAR, BEST, INTERMEDIATE and WORST Table 3 The proportions of the four bins related to ACC and RMSE at 6-month and 12-month lead times in Fig. 6 Metric Lead time SIMILAR the ENSO predictions are much larger at 12 months lead time, since forecast uncertainties grow due to the chaotic behaviour of the system. It is found that the ENSO events predicted 12 months ahead have weaker amplitudes than the observations and the phases of some weak ENSO events are shifted. Nevertheless, the observations generally fall within the range of the NorCPM predictions (green shading in Fig. 7) for most ENSO events, indicating that our system is quite reliable. The strongest El Niño of 1997/98 is well predicted 12 months ahead albeit its amplitude is too weak. We assess the skill of the NorCPM reanalysis (Sect. 2.2)-from which our hindcasts are initialised-in representing the variability of the depth of the 20 • C isotherm (Z20), because realistic fluctuations of thermocline depth (equivalent to warm warm volume) are critical for ENSO prediction (Smith et al. 1995;Meinen and McPhaden 2000;McPhaden 2003;Zhu et al. 2015a). Figure 8 presents monthly variation of ACCs between Z20 anomalies (averaged over 5 • S-5 • N) derived from the EN4 dataset (EN4.1.1) and the NorCPM reanalysis over the period of 1985-2010 in the equatorial Pacific. It is found that assimilation of SSTA in NorCPM constrains well Z20 in the equatorial Pacific, apart from August and September over 180 • -160 • W and February to June over the 160 • -80 • W. The poor skill in the central equatorial Pacific in August and September is connected to an unrealistic thermocline feedback in our model. The thermocline feedback describes the influence of subsurface temperature variability on SST, mainly through the mean vertical advection of subsurface temperature anomalies (Jin and An 1999). We use the correlation between Z20 and SST anomalies to assess the strength of this feedback. Figure 9 shows the seasonality of SST-Z20 relationship in the observations, a free run of NorCPM (no DA, 30 members) and the reanalysis of NorCPM (30 members). Compared to the free run (top right panel in Fig. 9), the thermocline feedback over 180 • -160 • W is improved in the reanalysis but is still opposite to that in observations between August and September. The decrease in ACC in the boreal late winter and spring across the eastern equatorial Pacific (Fig. 8) is consistent with the findings of Zhu et al. (2015b). During that period, the thermocline feedback is very weak (Fig. 9). As a consequence of small covariance between SST and subsurface, DA does not effectively propagate the surface information into the ocean interior. Figures 10 and 11 presents the ACCs and RMSEs between observed and predicted Nino3.4 SSTAs as a function of hindcast lead time for all four season starts. The results indicate that the ACCs and RMSEs of NorCPM are within the range of that of the NMME models. This is consistent with Fig. 6. NorCPM is one of the models that perform best in terms of ACC and RMSE for July start and shows relatively lower RMSEs for all four season starts (Fig. 11). In addition, the ENSO prediction skill in all models is season-dependent (Luo et al. 2015;Zhu et al. 2015bZhu et al. , 2017. However, NorCPM has a more pronounced dip in ACC in May (i.e., a more pronounced spring predictability barrier) than the NMME systems (Fig. 10). NorCPM is even out of range of the NMME systems in May, but its ACC reemerges into the ACC range of the NMME systems beyond the boreal spring.
As the May dip of the ENSO predictability occurs for all season starts (Fig. 10), it is likely due to a deficiency in our model in the representing ENSO dynamics rather than in the initial conditions. However, it is unlikely that the Bjerknes feedback during the boreal spring is responsible for the May dip in SST prediction skill, as the relation is reasonably captured from February to May (Fig. 12). The thermocline feedback is a more likely factor. In the free run the thermocline feedback is of opposite sign in the western and eastern parts of the Nino3.4 region (170 • -120 • W), while in observations the feedback is generally positive but strongest in the east (Fig. 9). The discrepancies between observations and the model are largest from February to May, as the model overestimates both the negative correlation in the western part and the positive correlation in the eastern part of the Nino3.4 region. During the rest of the year, the thermocline feedback strengthens in the east and the negative correlations weaken in the west, and as a result the model and observations agree better (Fig. 8). The poorly simulated relation between Nino3.4 SST and Z20 anomalies during boreal spring can explain the discrepancy between predicted and observed SST anomalies during these months. As discussed above, it can also explain the poorer skill in constraining the subsurface temperature in the central and eastern Pacific from February to May in the reanalysis seen in Fig. 8. A further contributing factor is the sharp reduction in skill in predicting thermocline depth anomalies in February and March in the Nino3.4 region for all four season starts (Fig. 13), as it can take 1-2 months for the SST in this region to respond to the subsurface changes (Zelle et al. 2004). The reason for this reduction skill is not clear but it could be potentially related to the poorly simulated Bjerknes feedback during the second half of the year (Fig. 12), as the warm water volume will adjust with 6-9 months delay to the covarying off-equatorial wind stress (Jin 1997). In summary, the May dip in predicting SST in the Nino3.4 region is likely caused by the poorly predicted Z20 anomalies in early boreal spring together with an incorrect influence of these Z20 anomalies on SST.
The cause of the model errors leading to the May dip in skill are difficult to conclusively identify because of the complexity of the coupled system. However, we speculate that there is a close relation between these errors and those in the model climatology. During March to May our model shows an excessive rain band south of the equator that extends across the entire Pacific, while there is a dry bias in the equatorial western Pacific as common to many climate models ( Fig. i in supplementary material). As a result, the seasonal weakening of the equatorial trade winds is not properly simulated by our model and nor is the development of the seasonal warming of the equatorial eastern Pacific during boreal spring. The observed equatorial seasonal cycle is connected with the southern hemisphere seasonal cycle because the ITCZ remains north of the equator (Chang and Philander 1994;Xie 1994); this connection is weakened by the double ITCZ bias in boreal spring in our model. We can infer from the poor seasonal cycle in equatorial zonal winds and SST (Fig. ii in supplementary material) that the seasonal cycle of equatorial upwelling is also underestimated. (We don't plot the seasonal cycle of vertical velocity as it was not output by the model.) This can explain why the simulated between SST and Z20 anomalies remain strongly related across  (Fig. 9). We will further assess the impact of model climatology errors on seasonal prediction skill using an anomaly coupled version of the model (Toniazzo and Koseki 2018), but this work is beyond the scope of this current paper.
ENSO can influence other quantities than SST or regions beyond the tropical Pacific (ENSO teleconnections, Luo et al. 2015). Here, we focus on April hindcasts for the boreal winter (December, January and February; DJF) in which strong ENSO teleconnections have been reported (Wallace and Gutzler 1981). Figure 14 shows the ACCs between observations and NorCPM hindcasts for SSH, precipitation, SLP and geopotential height at 500 hPa (Z500) in the boreal winter. The SSH prediction is skilful in the equatorial Pacific and eastern Indian ocean and the tropical Atlantic. The high prediction skill of precipitation in the equatorial Pacific, Indonesia and Amazon is associated with the good skill in predicting El Niño events. For SLP and Z500, higher ACCs are mostly found in the tropics. Some regions with significant ACCs are found in the north Pacific in good agreement with the teleconnections to ENSO reported by Alexander et al. (2002).

Regional Arctic SIE prediction
The Arctic sea ice forecast is of great importance for local communities and stakeholders (McGoodwin 2007;Liu and Kronbak 2010). We expect that a system predicting SST well has some skill in predicting the Arctic sea ice, because SST and sea ice are highly anticorrelated. The top panels of Fig. 15 present the seasonality of determination coefficient ( R 2 ) between detrended SIE and heat content in the upper 300 m (HC300) from the reanalysis of NorCPM for The results indicate that the variability of SIE is highly related to HC300, in particular in the GIN Seas and in the boreal winter-spring in the Barents and Labrador Seas. This agrees with previous findings (Bitz et al. 2005;Onarheim et al. 2015;Årthun et al. 2017). As shown in Sect. 3, Nor-CPM is skilful in predicting the SST variability in the region that extends from the Iceland Basin to the Barents Sea. Thus we expect skill in predicting the sea ice variability in the Arctic regions adjacent to the north Atlantic. The bottom panels of Fig. 15 show the correlation coefficients between detrended SIEs from the hindcasts and Had-ISST2.1.0.0 for the GIN Seas, Barents Sea and Labrador Sea. We perform the Student's t-test at the significance level of 5% to test statistical significance of the correlation coefficient. The correlation coefficients that are not significantly different from zero are marked by black dots (Fig. 15). The correlation coefficients in the other regions of the Arctic basin are not presented in this paper, since it is mostly not significant. Consistent with the findings of Koenigk and Mikolajewicz (2008), Day et al. (2014) and Bushuk et al. (2017), the highest SIE predictability is found in the basins adjacent to the north Atlantic. Overall, NorCPM provides skilful SIE predictions in the boreal winter and spring while the prediction skill is not significant in the rest of the year (bottom panels of Fig. 15). During the boreal winter and spring in particular in the Barents Sea, warmer ocean condition yields larger heat loss to the atmosphere that reduces sea ice freezing and smaller sea ice cover (Årthun et al. 2012). In the rest of the year, both the initial state of sea ice and the atmospheric variability have large influence on the SIE variability (Bushuk et al. 2017). This is confirmed by the seasonality of the relation between SIE and HC300 (top panels in Fig. 15) as well.
The Arctic SIE prediction skill of NorCPM is regiondependent (Fig. 15). In the GIN Seas, NorCPM is skilful in the first three lead months from January and April. SIE predictions starting in July and October are not skilful, since SIE also depends on the volume of outflow of sea ice via the Fram Straits (Bitz et al. 2005). In the Barents Sea, Nor-CPM is skilful in the target months from January to June at lead times of from 1 to 8 months. This reflects that the heat transported by the Gulf Stream and North Atlantic Current, and through the Norwegian Seas into the Barents Sea plays a relevant role in determining the sea ice during the melting season in this region. In the Labrador Sea, the predictions initialised in January are skilful until June, those initialised in April are only skilful in the first 2 months and those initialised in October are skilful in the first three lead months.

Conclusions
This study clearly demonstrates that assimilating SST observations with an advanced DA method can be as or even more competitive than the current state of the systems that assimilate more data. We demonstrate this using the NorCPM, and The global SST prediction skill (ACC and RMSE) of NorCPM at 6-and 12-month lead times is generally higher than the averaged skill of 13 NMME systems, especially in the tropical western Atlantic and the region extending from the Iceland Basin to the Barents Sea. The skill in these regions is due to the combined effect of external forcing and DA. Like the NMME systems, NorCPM has the highest prediction skill in the ENSO region, which is found mainly due to DA. In addition, NorCPM is found to rank among the better models of the NMME in most regions.
NorCPM can predict well the ENSO events 6 months ahead. Although the ENSO events predicted 12 months ahead are less accurate with respect to their amplitudes and timings, the observations fall mostly within the prediction uncertainties of NorCPM. The high ENSO predictability is due to the fact that NorCPM skilfully constrains warm water volume by assimilation of SSTA. The ENSO prediction skill of NorCPM is season-dependent and NorCPM is one of the better systems for July starts. However, NorCPM has a spring predictability barrier more pronounced than that of the NMME systems. There is a pronounced skill drop in May independent of season starts but the model performance recovers in June. An analysis demonstrates that the skill drop in May is likely linked to a weak and inconsistent thermocline feedback in NorCPM from February and March. Despite this limitation, NorCPM reproduces reasonably well ENSO teleconnection patterns in the boreal winter and provide skilful predictions beyond the tropical Pacific.
NorCPM shows some skill in predicting SIE in the GIN Seas, the Barents Sea and the Labrador Sea from January and April. In these regions, the SIE predictability in the boreal winter and spring is highly related to the initialisation of upper ocean heat content, which is consistent with the findings of Bushuk et al. (2017). In addition, the SIE variability is most predictable in the Barents Sea where NorCPM is skilful in the months between January and June and with up to 8 months lead time.
In the future, we are going to include oceanic subsurface data (such as Argo float data) into the initialisation scheme of NorCPM for seasonal-to-decadal predictions, since such data are crucial to constrain the ocean vertical structure ). It will be interesting to demonstrate whether and at which time scales using more data in Nor-CPM leads to higher prediction skill. On the other hand, Bushuk et al. (2017) Fig. 15 Top panels present the seasonality of determination coefficient ( R 2 ) between detrended SIE and HC300 from the reanalysis of NorCPM for the GIN Seas, the Barents Sea and the Labrador Sea (black areas). Bottom panels show the correlation coefficients between detrended SIEs from the hindcasts of NorCPM and observations in these regions. Dots indicate the correlation coefficients that are not statistically significant at the 95% confidence level for assimilating sea ice concentration observations into Nor-CPM and showed a great potential of strongly ocean-sea ice coupled DA. We will, in addition to SST, assimilate sea ice observations into NorCPM, with the expectation of improving the seasonal prediction skill of Arctic sea ice.