1 Introduction

Earth system models are key to the IPCC assessment process but present multiple differences in term of resolution, subgrid-scale physics, and biogeochemical components. These differences affect their simulated modern climate and future projections. Thus it is essential to assess model skill, i.e., agreement between simulated and observed fields (Meehl et al. 2007; Randall et al. 2007).

Systematic assessment of numerical ocean models with data-based (e.g., Saba et al. 2010) and satellite-derived products (e.g., SeaWiFS, Carr et al. 2006) reveals large differences between ocean carbon models in terms of their ability to match relevant data fields (e.g., Sarmiento et al. 2000; Doney 2004; Schneider et al. 2008; Steinacher et al. 2010). Yet greater model complexity does not necessarily imply greater model skill (e.g., Friedrichs et al. 2007, 2009; Kriest et al. 2010). Unfortunately, when comparing ocean carbon models it is difficult to disentangle the main causes for disagreement because of their many differences, including resolution, parametrizations for subgrid-scale physics and biogeochemical components. This problem is exacerbated in Earth system models, where there are more model components, which differ as well.

To shed light on how simulated ocean circulation affects simulated ocean biogeochemistry, the Ocean Carbon-Cycle Model Intercomparison Project (OCMIP) compared simulations from different circulation models but with the same embedded diagnostic biogeochemical model (e.g., Najjar et al. 2007). This study illustrated how differences in surface forcing, subgrid-scale parametrizations, and resolution can drive discrepancies in simulated biogeochemical fields, rivaling those due to unrepresented biogeochemical processes. Since the OCMIP effort, marine biogeochemistry models have evolved (e.g. Hood et al. 2006) from models with one nutrient, one phytoplankton, one zooplankton, and one detritus (e.g., NPZD, Fasham et al. 1993) to multiple phytoplankton functional types and several limiting micronutrients (e.g., Le Quéré et al. 2005; Aumont and Bopp 2006). However, the OCMIP approach of comparing models with different circulation fields and the same biogeochemical model remains useful. Here, we use the same approach but in the context of the new generation of Earth system models that are being used in the ongoing IPCC assessment process in the framework of the Coupled Model Intercomparison Project (CMIP5, Taylor et al. 2009, 2011).

Although the full range of Earth system models taking part in CMIP5 use different ocean biogeochemical components, we take the OCMIP approach with three of them that use a common ocean biogeochemical model, PISCES (Aumont and Bopp 2006). In comparison to the diagnostic ocean biogeochemical model used in OCMIP, the nutrient cycling of PISCES is prognostic (i.e., it is not restored to observed nutrient concentrations). This biogeochemical model further includes a representation of the iron cycle and limitation by multiple nutrients (Aumont 2003; Schneider et al. 2008). Additionally, in the three selected Earth System Models, each of the ocean circulation models is based on the same code, although versions differ as to resolution and subgrid-scale physics (Madec 2008a, b; Marti et al. 2010; Voldoire et al. 2012; Dufresne et al. 2012). The largest differences in simulated ocean circulation fields probably stem from each ocean model being coupled to a different atmospheric model. The three atmospheric models differ in horizontal and vertical resolution, physics parametrizations, and architecture.

This effort has two objectives: (1) to evaluate models in terms of how well they simulate present-day ocean biogeochemical fields, and (2) to shed light on the relative roles of physical or biogeochemical controls on marine biogeochemistry. Within this framework, we examine differences between modeled and observed physical fields (sea surface temperature, sea surface salinity, wind stress, short wave radiation, net air-sea exchange of heat, mixed-layer depth and water mass transport) and marine biogeochemical fields (air-sea exchange of CO2 and O2, export of organic matter, and the 3-D distribution of biogeochemical tracers). These physical fields are known to have first-order effects on marine biogeochemistry (Sarmiento and Gruber 2006); these fields and their effects are sensitive to resolution, sub-scale physics, and subgrid-scale parametrizations. Model-data and model-model differences between the three models offer hints as for how to improve simulated marine biogeochemistry in the next generation of Earth system models.

2 Methods

2.1 Climate models

This study exploits three fully coupled 3-D atmosphere-ocean models, each using OASIS coupler (Valcke 2006). OASIS ensures space and time interpolation between models grids and time-steps. IPSL-CM4-LOOP was used during the IPCC Fourth Assessment Report (AR4; Meehl et al. 2007). The other two models, IPSL-CM5A-LR and CNRM-CM5.1, are the latest generation of climate models that will contribute to the IPCC Fifth Assessment Report (AR5). Main models characteristics are listed in Table 1 and detailed below.

Table 1 Name, resolution and reference for the atmospheric, oceanic, sea-ice and land components of the three Earth system models

2.1.1 IPSL-CM4-LOOP

The IPSL-CM4-LOOP (IPSLCM4) model combines four major dynamical components of the climate system: atmosphere, land, ocean, and sea-ice. The atmospheric component is LMDZ-4 developed at LMD (Laboratoire de Météorologie Dynamique) with a horizontal resolution of 3.75° × 2.5° and 19 vertical levels (Hourdin et al. 2006). The land component is the global land-vegetation model ORCHIDEE (Organizing Carbon and Hydrology in Dynamic EcosystEms, Krinner et al. 2005), which simulates interactions between land and atmosphere. The ocean model, OPA8, has a horizontal resolution of 2°–0.5° at the Equator and 31 vertical levels (Madec et al. 1998). Vertical eddy and viscosity coefficients using a 1.5-order turbulent closure scheme are prognostically computed (Blanke and Delecluse 1993). The sea-ice component is LIM2 (Louvain Ice Model, Fichefet and Morales-Maqueda 1997), which resolves three layers (one for snow and two others for sea-ice) and accounts for sea-ice dynamics. The model is described in more detail in Marti et al. (2010), but this version includes the marine carbon cycle model PISCES (detailed below). The IPSLCM4 is a fully coupled climate and carbon cycle model (Cadule et al. 2010). Thus carbon sources and sinks over land and ocean are included.

2.1.2 IPSL-CM5A-LR

The current version of the coupled model developed at IPSL is reference as to IPSL-CM5A-LR (IPSLCM5, Dufresne et al. 2012). It represents the most recent development of the previous version IPSLCM4. As before, IPSLCM5 combines the four major components of the climate system, but it now also includes a module for tropospheric chemistry, aerosols, and online interactions with land vegetation. The atmosphere and land models of IPSLCM5 are updated versions of those used in IPSLCM4, namely, the atmospheric general circulation model LMDZ (Hourdin et al. 2006) and the ORCHIDEE land-surface model (Krinner et al. 2005). The atmospheric and land components use the same regular horizontal grid with 96 × 96 points, representing a resolution of 3.6° × 1.8°, while the atmosphere has 39 vertical levels. The INCA (INteraction between Chemistry and Aerosol, e.g., Schulz 2007 and Szopa et al. 2012) model is used to simulate tropospheric greenhouse gases and aerosol concentrations, while stratospheric ozone is modeled by REPROBUS (Reactive Processes Ruling the Ozone Budget in the Stratosphere, Lefevre et al. 1994, 1998).

The oceanic component is based on the ocean part of the “Nucleus for European Modelling of the Ocean” (NEMOv3.2, Madec 2008a, b). It has a horizontal resolution of 2°–0.5° at the Equator and 31 vertical levels. NEMOv3.2 also includes the sea ice model LIM2 (Fichefet and Maqueda 1997), and the marine biogeochemistry model PISCES. NEMOv3.2 uses a partial-step formulation (Barnier et al. 2006), which improves the representation of bottom bathymetry and thus water flow and effects of friction at the bottom of the ocean. Compared to the parametrizations of the mixed-layer dynamics used in OPA8, those used in NEMOv3.2 accounts accounts for a double diffusion process (Merryfield et al. 1999), Langmuir cell (Axell 2002), and the contribution of surface wave breaking (Mellor and Blumberg 2004; Burchard and Rennau 2008). It also includes a parameterization of bottom intensified tidal-driven mixing, similar to Simmons et al. (2004), in combination with a specific tidal mixing parameterization in the Indonesian area (Koch-Larrouy et al. 2007, 2010). In addition, NEMOv3.2 includes prognostic interaction between the penetration of incoming shortwave radiation into the ocean and phytoplankton (Lengaine et al. 2009).

2.1.3 CNRMCM5.1

The CNRM-CM5.1 model (CNRMCM5, Voldoire et al. 2012 this issue) couples atmosphere, land, ocean, and sea-ice components. The atmospheric component is ARPEGE-CLIMATv5.2, which derives from ARPEGE (action de recherche petite echelle grande echelle, which means research project on small and large scales, Déqué et al. 1994). ARPEGE-CLIMATv5.2 is a spectral model using a resolution of about 1.4° in longitude and latitude, while vertically there are 31 levels. The land-surface component is SURFEX (SURFace EXternalisée, Manzi and Planton 1994; Noilhan and Mahfouf 1996), which prognostically simulates three types of surfaces: land surfaces, inland water bodies (e.g., lakes), ocean or seas. Atmospheric chemistry in CNRMCM5 includes only stratospheric ozone chemistry (Cariolle and Teyssèdre 2007).

The ocean component also uses NEMOv3.2 but in the ORCA1 configuration (Hewitt et al. 2011). In ORCA1, a nominal resolution of 1° at the equator is chosen to which a latitudinal grid refinement of 1/3° is added in the tropics. The vertical discretization accounts for 42 levels and also includes partial-steps. Compared to IPSLCM5, this configuration includes an improvement of the Turbulent Kinetic Energy closure scheme (Madec 2008a, b) from Blanke and Delecluse (1993). That is, it allows a fraction of surface wind energy to penetrate below the base of the mixed-layer, improving the coupling between surface wind and mixed-layer depth. The sea-ice model is GELATO (Global Experimental Leads and ice for ATmosphere and Ocean, Salas Mélia 2002). GELATO is a multi-category model. Sea-ice dynamics is represented by the EVP rheology (Hunke and Dukowicz 2012). In CNRMCM5, the interaction between incoming shortwave radiation penetration into the ocean and phytoplankton (Lengaine et al. 2009) is assumed constant (chlorophyll concentration is equal to 0.05 gChl L−1).

2.2 Marine biogeochemistry model: PISCES

PISCES (Pelagic Interaction Scheme for Carbon and Ecosystem Studies) simulates the biogeochemical cycles of oxygen, carbon and nutrients using 24 state variables. Macronutrients (nitrate and ammonium, phosphate, and silicate) and the micronutrient iron limit phytoplankton growth and thus improve the representation of their dynamics (Aumont et al. 2003; Aumont and Bopp 2006).

PISCES has two sizes for phytoplankton (nanophytoplankton and diatoms), for which growth is parametrized using the Eppley et al. (1969) formulation but limited by external availability of nutrients. Diatoms differ from nanophytoplankton because they need silicon and more iron (Sunda and Huntsman 1997) and because they have higher half-saturation constants due to their larger mean size. In PISCES, zooplankton is distinguished by two size-classes: microzooplankton and mesozooplankton. The ratios between carbon, nitrate, and phosphate are considered constant and held to the values proposed by Takahashi et al. (1985) for all plankton classes. However, internal concentrations of iron are simulated prognostically in both phytoplankton classes as is silica in diatoms. Phytoplankton growth depends on external concentrations of nutrients and light availability, while chlorophyll concentration is simulated prognostically following Geider et al. (1998). PISCES also simulates semi-labile dissolved organic matter as well as small and big sinking particles, which differ in their sinking speeds (i.e., 3 and 50–200 m d−1, respectively). For these three compartments, PISCES imposes fixed stoichiometric ratios of O:C:N:P after Takahashi et al. (1985). However, concentrations of iron, silicon, and calcite are simulated prognostically inside the respective pools of sinking particles.

The fate of mortality and aggregation of nanophytoplankton depends on the proportion of the calcifying organisms. In PISCES, we assume that half of the organic matter of the calcifiers is associated with the shell. Since calcite is denser than organic matter, 50 % of the dying calcifiers biomass is routed to the fast sinking particles. The same is assumed for the mortality of diatoms as a consequence of the higher density of biogenic silica. The sinking speed of the particles is not altered by their content in calcite and biogenic silicate, while their degradation depends on the local temperature. Remineralization of dissolved organic matter can occur within both oxic and anoxic waters. It depends on the local oxygen concentration and thus assumes that the degradation rates for oxic respiration and denitrification are identical. Differential sedimentations, turbulence coagulation mechanisms, and aggregation processes are considered for dissolved and particulate organic matter. Only the differential sedimentation is omitted for dissolved organic matter, because its contribution is almost negligible compared to turbulence.

In PISCES, inorganic carbon is represented by dissolved inorganic carbon, alkalinity and calcite. Aragonite is not considered for the chemical dissolution of calcium carbonate in the water column. The proportion of calcifying phytoplankton is parametrized following the formulation of Moore et al. (2002), because PISCES does not prognostically represent this phytoplankton functional type. One the other hand, the dissolution rate of calcite follows the parametrization of Gehlen et al. (2007): there is no dissolution is allowed when waters are supersaturated, but dissolution increases with the level of under-saturation. Total alkalinity in PISCES includes contributions from carbonate, bicarbonate, borate, hydrogen, and hydroxide ions (Practical alkalinity). Oxygen is simulated prognostically using two different oxygen-to-carbon ratios, one accounting when ammonium is converted to organic matter, the other when oxygen is consumed while nitrification.

For carbon and oxygen pools, air-sea exchange follows the Wanninkhof (1992) quadratic wind-speed formulation. To ensure conservation of nitrate in the ocean, annual total nitrogen fixation balances denitrification following Lipschultz et al. (1990), Middelburg et al. (1996), and Soetaert et al. (2000). Thus, carbon and nitrogen cycles are partly decoupled.

The geochemical boundary conditions also account for nutrient supply from three different sources: atmospheric dust deposition of iron and silicon (Tegen and Fung 1995; Jickells and Spokes 2001; Moore et al. 2004), rivers for macronutrients, dissolved carbon, and alkalinity. Ludwig et al. (1996) and sediment mobilization for sedimentary iron (Johnson et al. 1999; de Baar and de Jong 2001).

For all macronutrients except iron and nitrogen, the external sources are balanced yearly by an equivalent loss from the bottom of the ocean. This loss, which represents geological burial within the sediments, is proportional to the sinking rate. For nitrogen, this loss balances yearly the amount of nitrogen provided by atmospheric deposition and riverine input since nitrogen fixation is hypothesized to compensate for denitrification at each time-step. Due to its short residence time, iron budget is not balanced in PISCES. A fraction of dissolved iron is scavenged by sinking organic matter. Then, all the particulate iron that reaches the bottom of the ocean is definitely removed from the ocean.

2.3 Simulations

Our results derive from historical simulations that have been obtained in three sequential steps:

  • First, there is the 3000-year spinup, an off-line integration forced by monthly climatological fields repeated annually and atmospheric CO2 held at 278 ppm, to obtain a quasi-steady state for the preindustrial ocean of each climate model. For each model, monthly climatological fields have been generated from a first preindustrial simulation, in which only dynamical components have been integrated starting from observed climatologies (e.g., Locarnini et al. 2009; Antonov et al. 2009). The methodology used for integrating IPSLCM4, IPSLCM5, and CNRMCM5 are further detailed in Marti et al. (2010), Dufresne et al. (2012), and Voldoire et al. (2012, this issue), respectively.

  • Second each model is then integrated online to allow for interannual variations in biogeochemical tracer fields. These on-line integrations are shorter (100 years IPSLCM4 and 300 years for IPSLCM5 and CNRMCM5). At the end of this on-line spin-up, the linear drift of the global ocean carbon flux is about 0.013 PgC y−1 per year for IPSLCM4, 0.001 PgC y−1 per year for IPSLCM5 and 0.002 PgC y−1 per year for CNRMCM5. These drifts are comparable or even smaller than the OCMIP-2 threshold, i.e. 0.01 PgC y−1 per year (Orr 2002). Compared to the estimation of preindustrial global air-sea fluxes from forward model (e.g., Aumont et al. 2001) or inverse model (e.g., Mikalhof-Fletcher et al. (2007), IPSLCM4 overestimates the net ocean outgassing of about 0.8 PgC y−1, while the two CMIP5 models are within the range of models estimates: 0.4 PgC y−1 for CNRMCM5 and 0.3 PgC y−1 for IPSLCM5.

  • Third, there is the transient simulation over the industrial era (beginning in 1860 for IPSLCM4, and 1850 for IPSLCM5 and CNRMCM5).

External greenhouse gas forcing for the historical scenario differs between the CMIP3 model and the two CMIP5 models. As IPSLCM4 is a fully coupled climate-carbon cycle Earth system model, it is forced only by anthropogenic CO2 emissions from fossil fuel burning and land-use change from 1860 to 2005 (Houghton and Marland 2007). IPSLCM5 and CNRMCM5 are forced by the historical anthropogenic CO2 concentrations from 1850 to 2005 (Taylor et al. 2009, 2011). During 1985–2005, historical anthropogenic CO2 concentrations is about 379 ppm for IPSLCM5 and CNRMCM5. Despite being forced by emission, the prognostically computed historical anthropogenic CO2 concentrations is close to the latter, facilitating comparison among models and with observations. For IPSLCM5 and CNRMCM5, other greenhouse gases (e.g., nitrous oxide, chlorofluorocarbon) and aerosols are also taken into account. In both, five tropospheric aerosol types are considered: sea salt, sulfate, organic, black carbon, and desert aerosols. Volcanic aerosols are specified as a stratospheric aerosol type. These fields were estimated from observed emissions by the LMDZ-INCA chemical model (Szopa et al. 2012).

In parallel, we completed this set of model experiments by making offline simulations of CFC11 and CFC12 using a common passive tracer module (Dutay et al. 2002). This passive tracer module has been forced by monthly circulation fields from the three Earth system models and by observed atmospheric concentrations of CFC11 and CFC12 from 1950 to 2005.

2.4 Assessment methods

To evaluate simulated ocean dynamics, we first relied on data for physical variables clearly tied to the circulation, which controls ocean biogeochemistry. For observed sea surface temperature and sea surface salinity, we used the World Ocean Atlas 2009 (WOA 2009, Locarnini et al. 2009; Antonov et al. 2009), a database that combines observations from 1965 to 2007. For mixed-layer depth we used the observed climatology of de Boyer Montégut et al. (2004, updated in September 2008), in particular their vertical density criterion of 0.03 kg m−3 computed from in situ temperature and salinity collected since 1961. For wind stress, our reference is the NCEP 1957–1996 reanalysis of Kalnay et al. (1996). Simulated short wave radiation is compared to the global climatology of Gupta et al. (1999), i.e., a monthly climatology from the Earth Radiation Budget Experiment. For assessing modeled ocean-atmosphere net heat fluxes (i.e., sum of solar radiation, longwave radiation, sensible heat transfer by conduction and convection, and latent heat transfer by evaporation of sea surface water), we use the 1981–2005 reanalysis of Yu et al. (2007), which provides a monthly climatology combining observations from surface meteorological stations and satellite data. Finally, for data-based estimates of watermass transport of North Atlantic Deep Water, Antarctic Bottom Water, and the Antarctic Circumpolar Current, we relied on estimations from Talley et al. (2003), Ganachaud (2003), Cunningham (2003).

As references for simulated biogeochemistry, we also based our analysis on global data products. We use observed climatologies of silicate (SiO2), phosphate (PO 3−4 ), nitrate (NO 3 ), and oxygen O2 from the World Ocean Atlas 2009 (Garcia et al. 2009a, b). For the concentrations of dissolved inorganic carbon (DIC) and alkalinity (Alk), we used the GLODAP climatology (Key et al. 2004), which is based on discrete measurements collected during 1972–1999 but centered in 1994. For the difference in the partial pressure of CO2 between the ocean and the atmosphere (\(\Updelta p \, \hbox{CO}_{2}\)) we relied on the Takahashi (2009) climatology referenced to year 2000. Simulated surface chlorophyll (CHLa) was compared to the SeaWiFS climatology based on satellite observations from 1997 to 2003 (Carr et al. 2006). Additionally, we used multiple references for data-based estimates of marine primary productivity (Behrenfeld et al. 2006) as well as export of organic carbon, calcite, and silicate (e.g., Laws et al. 2000; Schlitzer 2002; Heinze et al. 2003; Moore et al. 2004; Jin et al. 2006; Dunne et al. 2007).

To provide quantitative assessment of model skill, we used several common statistical metrics (e.g., Stow et al. 2009): (1) correlation coefficient (R) for assessing dominant large-scale patterns between modeled and observed fields; (2) normalized standard deviation for quantifying differences between modeled and observed spatial variations; (3) bias (average error) as a measure of average mismatch between modeled and observed fields; and (4) root mean square error (RMSE) for assessing the total mismatch between modeled and observed fields. All of these computations are performed after regridding model output to the observed grid (2° × 2° for partial pressure of CO2 and 1° × 1° for other variables) and weighting by grid-cell area.

Statistical metrics (i.e., RMSE, bias, correlation coefficients and standard deviation) are computed from modeled and observed fields at surface, 500, 1,000, 1,500, 2,000, 2,500 and 3,000 m. Correlations and standard deviations are summarized in Taylor diagrams for distributions at the surface, 500, 1,000 and 3,000 m.

3 Results

3.1 Ocean physics

3.1.1 Ocean-atmosphere interactions and ocean surface properties

For an atmospheric perspective, we present at first the ability of the Earth system models results to match the observed net heat flux (Qnet), incoming shortwave radiation (SW), and wind stress (τ). It allows assessing surface forcings that the atmosphere exerts on the surface ocean controlling the dynamics of several ocean fields and those of air-sea exchange of O2 and CO2 (e.g., Sarmiento and Gruber 2006). For an oceanic perspective, we show the maximum annual of mixed-layer depth (MLD, during winter) as well as its minimum (during summer). The maximum and the minimum of MLD are commonly used to diagnose how a physical model is suited to simulate proper ocean biogeochemistry (e.g., Doney 2004; Schneider et al. 2008). The maximum MLD controls nutrient supply within photic zone while minimum MLD controls light limitation (e.g., Sarmiento and Gruber 2006; Behrenfeld 2010).

Figure 1a shows that for both τ and MLD,correlations between simulated and data-based annual-mean fields range between 0.5 and 0.8 (Fig. 1a). Generally, correlations for both IPSLCM5 and CNRMCM5 are higher than those for IPSLCM4; normalized standard deviations (σ) range between 0.4 and 1.9. Models broadly replicate Qnet (R > 0.7 and σ ~ 0.9). Including the annual cycle in the analysis decreases the correlation between modeled and observed fields. This appears clearly for correlations and spatiotemporal variations of Qnet, which are poorly represented by the models (σ ~ 0.3 and R ~ 0.3). In contrast, spatiotemporal variations of modeled SW are better reproduced when taking into account the annual cycle.

Fig. 1
figure 1

Taylor diagram showing the correspondence between model results and observations for a shortwave radiation (SW, Gupta et al. 1999), net heat flux (Qnet, Yu et al. 2007), wind stress (τ, Kalnay et al. 1996) and mixed-layer depth (MLD mean , de Boyer Montégut et al. 2004) and b sea surface temperature (SST, Locarnini et al. 2009), minimum (summer) mixed-layer depth (MLD min ), maximum (winter) mixed-layer depth (MLD max) and sea surface salinity (SSS, Antonov et al. 2009). Large symbols indicate statitics from annual mean distributions while small symbols indicate that computations include annual cycle

All three models exhibit reasonable large-scale structure of annual-mean sea surface temperature (SST) and sea surface salinity (SSS) (Fig. 1b). Models generally reproduce observed SST (Loccardini et al. 2009) with correlation coefficients above 0.95 and normalized standard deviations close to 1. The same generally holds for SST when we include the annual cycle. The largest differences between modeled and observed SST are found in the Equatorial Pacific Ocean and The Southern Ocean (supplementary materials, Figure S1). In the Equatorial Pacific, IPSLCM4 and IPSLCM5 exhibit higher than observed SST, while SST is slightly colder in CNRMCM5. The opposite happens in the Southern Ocean. As SST, annual-mean fields and annual cycle for SSS presents a good agreement with the observations (R > 0.95 and σ ~ 1). The largest biases between modeled and observed SSS are found in high latitude oceans, especially in the Southern Ocean.

Figure 2 reveals that models broadly capture the general zonal average structure of maximum MLD (MLDmax), but still exhibit substantial differences compared to the observed climatology (de Boyer Montégut et al. 2004): (1) In the northern oceans, IPSLCM4 and IPSLCM5 fail to simulate the deep MLD max in Labrador Sea, an important deep-convection site; (2) In the tropical ocean, the MLD max in IPSLCM4 and IPSLCM5 roughly agrees with observed values, but CNRMCM5 overestimates it by 20–40 m; (3) Southern Ocean (<30°S) MLD max is poorly simulated by the models, all of which fail to simulate the deep MLDmax that is observed in the Pacific sector; and (4) In western boundary currents (e.g., Gulf Stream, Kuroshyo, Malvinas-Brazil confluence), MLDmax is overestimated in all models, particularly CNRMCM5. Figure 2 also shows that there are also large discrepancies between modeled and observed minimum MLD (MLDmin). The MLDmin is better simulated in CNRMCM5 than in IPSLCM4 and IPSLCM5, because they capture the deep MLDmin in the Southern Ocean. Although it remains too shallow, simulated MLDmin in the Southern Ocean improves with the evolution from IPSLCM4 to IPSLCM5 (Fig. 2). In the Equatorial Pacific, MLDmin is overestimated in all models, particularly CNRMCM5.

Fig. 2
figure 2

Comparison between modeled and observed depth (in m) of the maximum of mixed-layer (MLDmax) and the minimum mixed-layer (MLDmin). Observed climatology is based on an update of the climatology of de Boyer Montégut et al. (2004), which incorporate Argo’s data until september 2009

3.1.2 Water mass transport and general circulation

Table 2 shows that the magnitude of simulated NADW transport is low relative to observations in IPSLCM4, while those of IPSLCM5 and CNRMCM5 are within the low end of the observational uncertainty (e.g., Talley et al. 2003; Ganachaud 2003). The northward transport of AABW is underestimated in IPSLCM5 and CNRMCM5 relative to observational estimates (e.g., Talley et al. 2003). All models underestimate the observed strength of the ACC (e.g., Talley et al. 2003; Cunningham 2003), although simulated transport in IPSLCM5 and CNRMCM5 is more similar to the observed transport relative to the much lower transport of IPSLCM4 (i.e., 60–70 Sv lower).

Table 2 Estimation from observations (e.g., Talley et al. 2003; Cunninghan 2003) and models of the Atlantic Meridional overturning circulation (AMOC, maximum of meridional streamfunction between 300 and 2,000 m and 0–75°N), Antarctic bottom waters export (AABW, minimum of meridional streamfunction between 2,000 m and the bottom of the ocean), Antarctic circumpolar current (ACC, maximum of barotropic streamfunction at Drake) in Sverdrup (Sv = 106 m3 s−1)

3.2 CFC11 and CFC12 distribution

CFC11 is an inert, abiotic, and entirely man-made tracer in seawater, which is set solely by ocean dynamics and uptake at air-sea interface, not by biological activity. Modeled CFC11 in surface waters is generally quite similar, being essentially set by atmospheric CFC11 and SST patterns. But, modeled CFC11 in subsurface waters and inventories can differ greatly providing information on ocean dynamics (e.g., Matsumoto et al. 2004). Global inventories of CFC11 are about 4.3 × 108, 6.3 × 108, and 6.4 × 108 mol for CNRMCM5, IPSLCM4 and IPSLCM5, respectively. This shows that models bracket the observed estimate of 5.2 × 108 mol. The simulated and observed CFC11 inventories (GLODAP, Key et al. 2004) shown in Fig. 3 illustrate how the largest discrepancies occur in the Southern Ocean. In this region, models manage to capture the location of subduction zone of mode waters (~40–50°S), especially in IPSLCM5 and CNRMCM5. However, south of 50°S, models exhibit large differences in their CFC11 inventories. Compared to the observations, the simulated CFC11 inventories are low in CNRMCM5 and high in IPSLCM4. These biases occurring in the zone of deep-convection reveal strong differences in their ability to ventilate subsurface waters. In particular, Fig. 3 indicates that models have trouble to replicate the convection zones in the Weddell Sea for IPSLCM4 and in the North Atlantic for IPSLCM5 and CNRMCM5.

Fig. 3
figure 3

Annual mean of CFC11 inventory (in nmol m−2) and zonal integral inventory (in mol m−2) for IPSLCM4, IPSLCM5 and CNRMCM5. Observed Climatology is based on GLODAP climatology. Model outputs and observations use 1994 as reference year (Key et al. 2004)

3.3 Marine biogeochemistry

3.3.1 Surface properties

3.3.1.1 Annual-mean macronutrient distributions

The three models generally capture the large-scale structure of annual-mean surface concentrations of macronutrients (Table 3). Correlation coefficients between modeled and observed surface concentrations of macronutrients are high (R ~ 0.9), except for SiO2 in IPSLCM4 (R ~ 0.6). Global average bias and RMSE reveal that models are skillful to simulate surface macronutrients concentrations (Table 3). Only NO 3 surface concentrations in CNRMCM5 are strongly overestimated by about 5 μmol L−1 (~16 % of the global mean).

Table 3 Weighted-area correlations coefficients (R), root mean square error (RMSE) and bias/average error (AE) of surface concentrations of annual-mean phosphate, nitrate, silicate, chlorophyll and \(\Updelta p \hbox{CO}_{2}\) compute between models results on observational data (Carr et al. 2006; Takahashi 2009; Garcia et al. 2009a, b, for chlorophyll, \(\Updelta p \hbox{CO}_{2}, \) and nutrients, respectively)

All models generally capture the observed dominant surface patterns for PO 3−4 and SiO2; for NO 3 , IPSLCM4 and IPSLCM5 similarly agree with overall observed patterns but CNRMCM5 does not (Fig. 4). In the tropics, models tend to underestimate the high surface concentrations that are associated with the strong Equatorial Pacific upwelling (Fig. 4). In the high latitudes, all models underestimate observed zonal-mean concentrations between 40°N and 60°N, mostly because surface concentrations in North Pacific are too low. In the Southern Ocean, models diverge in their ability to properly simulate surface macronutrients. Only IPSLCM5 yields sufficiently high surface SiO2 concentrations observed close to the Antarctic shelf. Zonal averages illustrate that simulated gradients in macronutrients in CNRMCM5 are shifted southward relative to those for IPSLCM4 and IPSLCM5 (Fig. 4), a feature that is related to the position of maximum wind stress, stemming from different atmospheric components of the coupled models (see Sect. 4.2).

Fig. 4
figure 4

Annual-mean of seasonal climatology of surface concentrations (in μmol L−1) of phosphate (PO 3−4 ), nitrate (NO 3 ), and silicate (SiO2) over 1985–2005 period for IPSLCM4, IPSLCM5 and CNRMCM5. Observed climatology is based on the World Ocean Atlas 2009 (Garcia et al. 2009a, b)

3.3.1.2 Annual-mean surface CHLa

CHLa provide evidence of nutrient and light limitation controlled by the ocean mixed-layer dynamics (e.g., Behrenfeld 2005, 2010; Behrenfeld et al. 2006). Due to its global scale monitoring over several decades (e.g., Siegel et al. 2002; Carr et al. 2006), CHLa consists in a usefull indicator for assessing skill of ocean biogeochelical models.

Generally, models roughly agree with large-scale observed patterns (R ~ 0.4, Table 3). This is especially the case for high CHLa surface concentrations associated with western boundary currents (Fig. 5). However, models systematically underestimate CHLa concentration in the oligotrophic gyres of the subtropical oceans and the upwelling regions such as the Eastern Equatorial Pacific. Yet, in these regions, SeaWiFS may detect the deep CHLa maximum (e.g., Garcia et al. 2005). Models differ most in the Southern Ocean, where IPSLCM4 and IPSLCM5 simulate values between 0.3 and 0.8 μgChl L−1, much higher than those estimated by satellite (0.2 μgChl L−1). However, models fall within the observation uncertainties knowing that SeaWiFs underestimates CHLa surface concentration in the Southern Ocean (e.g., Garcia et al. 2005).

Fig. 5
figure 5

Annual mean of seasonal climatology of surface chlorophyll (CHLa, in μgChl L−1) concentrations over 1985–2005 period for IPSLCM4, IPSLCM5 and CNRMCM5. Observed climatology is based on SeaWiFS satellite product (Carr et al. 2006)

3.3.1.3 Annual-mean \(\Updelta p \hbox{CO}_{2}\)

A global data product for the ocean minus atmospheric difference in the partial pressure of carbon dioxide (pCO2), known as \(\Updelta p \hbox{CO}_{2},\) has been computed from millions of discrete measurements of oceanic pCO2 (Takahashi et al. 2002, 2006; Takahashi 2009). It is known that \(\Updelta p \hbox{CO}_{2}\) drives the air-sea CO2 flux and that it is affected by solubility (function of SST, SSS, DIC and Alk), ocean transport, and biological activity (photosynthesis and respiration).

Compared to the climatology of Takahashi (2009), models broadly capture large-scale structures of annual mean \(\Updelta p \hbox{CO}_{2}\) with correlation coefficients of about 0.6 ± 0.1 (Table 3). However, global averaged RMSE values range between 16 and 24 ppm and contrast with global averaged bias values, which range between −2 and −16 ppm. The latter reflects regional biases that tend to cancel each other. Regional model-data discrepancies shown in Fig. 6, illustrate that the largest differences occur in three main areas: (1) in the northern oceans, \(\Updelta p \hbox{CO}_{2}\) in IPSLCM4 and IPSLCM5 is much stronger than observed, whereas CNRMCM5’s \(\Updelta p \hbox{CO}_{2}\) is roughly on target; (2) in the Equatorial Pacific, the high observed \(\Updelta p \hbox{CO}_{2}\) is underestimated by IPSLCM4, whose Equatorial tongue is too confined toward the Eastern part of the basin (conversely IPSLCM5 and CNRMCM5 exhibit excessive outgassing); and (3) in the Southern Ocean, models poorly represent the observed structure of slight outgassing over large areas. Nevertheless, the Southern Ocean data may be biased due to seasonal sampling: most measurements have been taken during summer.

Fig. 6
figure 6

Annual-mean of seasonal climatology of surface \(\Updelta p \hbox{CO}_2\) (in ppm) over 1985–2005 period for IPSLCM4, IPSLCM5 and CNRMCM5. Observed climatology is based on Takahashi (2009)

3.3.2 Vertical distributions

Models broadly capture vertical observed gradients in macronutrients between the surface and 1000 m, but they have trouble to reproduce subsurface maxima in NO 3 and PO 3−4 as observed (Fig. 7, see also supplementary materials Figure S2 and S3).

Fig. 7
figure 7

Vertical profiles of mean concentration, root mean square error (RMSE), bias and correlation (R) of annual mean phosphate (PO 3−4 ), nitrate (NO 3 ), and silicate (SiO2) concentrations at surface, 500, 1,000, 1,500, 2,000, 2,500 and 3,000 m for IPSLCM4 (in green), IPSLCM5 (in red) and CNRMCM5 (in blue). Circles represent depths used in Figs. S3 and S5. Observed profile of mean concentration of phosphate, nitrate, and silicate (Garcia et al. 2009b) are represented in thin black line

For both these macronutrients, the fact that models underestimate concentrations between 500 and 1,000 m, while they overestimate the concentration between 1,500 and 2,000 m indicates that models poorly represent the remineralization depth. This appears clearly for all of the models in the RMSE profiles: with the largest values between 500 and 1,000 m, except for SiO2 where RMSE is larger between for 1500 and 2000 m, but to a lesser degree. In CNRMCM5, deep NO 3 concentrations are to high, worsening with the proximity of the Southern Ocean, indicating a problem with the ventilation of deep waters in that model. Simulated concentrations of PO 3−4 and SiO2 in all 3 models are similar to those observed, especially for PO 3−4 with correlation coefficients always above 0.8 and normalized standard deviations close to 1 (Figs. 7 and S2). But in CNRMCM5, for NO 3 both of these metrics decline sharply with depth.

Fields of O2, dissolved inorganic carbon (DIC), and total alkalinity (Alk) are of particular interest for assessing the skill of ocean biogeochemistry models to simulate air-sea fluxes. Unlike other tracers, surface O2 and DIC are affected by air-sea gas exchange; conversely, below the surface they act like other tracers being affected by ocean circulation and interior sources minus sinks.

Large model-model and model-data differences are found for vertical profiles of DIC, Alk and O2 as illustrated by profiles of global average bias and RMSE (Fig. 8). IPSLCM4 simulates the most realistic profiles for O2 and DIC (supplementary materials Figure S4 and S5). IPSLCM5 strays further from observed DIC and O2 below 1,500 m, but its deep Alk profile shows substantial improvement. CNRMCM5 shows the largest disagreement for profiles of DIC and O2, whereas for Alk, its skill is as good or better than the other models. Despite these strong biases in DIC and O2 profiles for IPSLCM5 and CNRMCM5, their correlation coefficients are larger than for IPSLCM4.

Fig. 8
figure 8

Vertical profiles of mean concentration (MEAN), root mean square error (RMSE), bias and correlation (R) of annual mean dissolved inorganic carbon (DIC), Alkaliniy (Alk) and oxygen (O2) concentrations at surface, 100, 500, 1,000, 1,500, 2,000, 2,500 and 3,000 m for IPSLCM4 (green), IPSLCM5 (red) and CNRMCM5 (blue). Circles represent depths used in Figs. S3 and S5. Observed profile of mean concentration of DIC and Alk (Key et al. 2004) and that of O2 (Garcia et al. 2009a) are represented in thin black line

3.4 Air-sea fluxes and biological production

Models differ substantially in their simulated air-sea CO2 fluxes, primary productivity, export production at 100 m and attenuation length (Table 4). For primary productivity, models global estimates mirror differences in Equatorial Pacific primary productivity. Otherwise, model estimates span globally between the low values from CNRMCM5 and high values from IPSL models. Yet our simulated ranges are within the range of published estimates based on observations (e.g., Laws et al. 2000), observations with inverse models (e.g., Schlitzer 2002), and forward models (e.g., Heinze et al. 2003; Moore et al. 2004; Jin et al. 2006; Dunne et al. 2007).

Table 4 Modeled and observation-based (i.e., Behrenfeld et al. 2006; Laws et al. 2000; Schlitzer 2002; Heinze et al. 2003; Moore et al. 2004; Jin et al. 2006; Dunne et al. 2007) annual-mean fluxes of matter at the ocean-atmosphere interface and fluxes of sinking particles toward the deep ocean

However, models exhibit large differences at depth compared to the previous estimates (Table 4). Globally, models underestimate the depth at which waters are saturated with the respect to the calcium carbonate mineral calcite, the calcite saturation horizon depth (CSH). This underestimation of the depth of CSH is strong in IPSLCM4 and IPSLCM5, but to a lesser degree in CNRMCM5. On the contrary, Table 4 shows that the remineralization length (Z remin ) is overestimated in all of the three models compared to the estimation based on the Martin’s exponent law (Martin et al. 1987). Note that we define Z remin as the depth at which 10% of the export of organic matter at 100 m remain following the exponent law of Martin et al. (1987).

4 Discussion

4.1 Model skill

The three Earth system models studied here exhibit reasonable skill in simulating SST and SSS, similar to other state-of-the-art Earth system models (e.g., Meehl et al. 2007; Randall et al. 2007). Although IPSLCM5 and CNRMCM5 are more skillful in simulating the dynamics of upper ocean than is IPSLCM4, all models have trouble to adequately simulate mixed layer depths, both the summer minimum and winter maximum, as well as wind stress and the dominant characteristics of the large-scale geostrophic circulation (e.g., intensity and position of the ACC). These physical deficiencies degrade the ability of the models to simulate biogeochemical fields, both in the upper ocean and at depth. Deep distributions of biogeochemical tracers reveal signatures of unrealistic simulated ocean dynamics, and ventilation of the deep waters over decadal to centennial timescales (supplementary materials Figure S2 to S5).

Although the three models differ in multiple ways, this exercise helps summarize how developments in the IPSL model have altered simulated ocean biogeochemical fields between IPCC AR4 and AR5. The major differences between IPSLCM4 and IPSLCM5 are as follows: (1) Surface macronutrients have improved, illustrated by lower biases and higher correlation coefficients, while large-scale structures of surface SiO2, CHLa, and \(\Updelta p \hbox{CO}_{2}\) are also closer to those observed; (2) Vertical profiles of alkalinity have improved; and (3) Deep-ocean ventilation has become weaker, which in turn has degraded deep distributions of biogeochemical tracers, especially O2. Relative to the two IPSL models, CNRMCM5 has larger biases and RMSE in NO 3 and O2, attributable to sluggish deep-ocean ventilation. Yet for other biogeochemical fields, CNRMCM5 results are comparable.

Despite model differences in biogeochemical fields, a model’s chemical potential to take up atmospheric CO2 is ultimately determined by alkalinity minus DIC (Alk-DIC). That difference generally serves as a good approximation for the carbonate ion concentration [CO 2−3 ] (Sarmiento and Gruber 2006). And [CO 2−3 ] is inversely related to the Revelle factor, making it a proxy for the buffer capacity of seawater and hence the chemical capacity for the ocean to take up atmospheric CO2. This aspect is interesting because Earth system models are used in the IPCC AR5 (CMIP5) framework to estimate compatible emissions and climate-carbon cycle feedbacks (e.g., Dufresne et al. 2012) from simulations with future Representative Concentration Pathways (RCP, Taylor et al. 2009, 2011).

All models simulate large-scale structures of Alk-DIC in the upper ocean that are similar to those oberved (Fig. 9). Model-data differences reveal that simulated Alk-DIC is generally too weak in the high latitudes, particularly the Southern Ocean (<30°S). The anomalies of Alk-DIC indicate that all of the three models fail in replicating the chemical properties of NADW and AABW. The simulated NADW, characterized by positive Alk-DIC anomalies, contains much more CO 2−3 than expected from observation-based estimates. The opposite happens for the simulated AABW. The fact that deep waters overall contains much more DIC than observed seems to be related to the coarse resolution of models (Orr 2002). Especially in the North Atlantic, the vertical structure of Alk-DIC anomalies shows that the NADW export is shallower than estimated in the three models, especially in IPSLCM4 and IPSLCM5.

Fig. 9
figure 9

Zonal-average annual-mean difference between alkalinity and dissolved inorganic carbon (Alk-DIC, in μmol L−1) for the GLODAP data-based estimate (Key et al. 2004), IPSLCM4, IPSLCM5, and CNRMCM5. Bottom panels represent zonal-average anomalies of Alk-DIC (Models-GLODAP). Thin solid lines are positive anomalies, while thin dashed lines are negative. Vertical profiles represent basin-averages for the Atlantic and Pacfic Ocean

Although all of the three models present large discrepancies in ocean circulation, we cannot attribute bias in tracer distribution to solely the ocean dynamics. Our analysis suggested that the remineralization length and calcite saturation horizon are poorly represented systematically in all of the three models (Table 4). The Martin’s exponent estimated from model output (Martin et al. 1987) indicates that the mean Martin exponent is about 0.79 for IPSLCM4, 0.75 for IPSLCM5 and 0.65 for CNRMCM5, while the expected Martin exponent is close to 0.858. This means that a large fraction of the sinking organic matter is remineralized at depth inducing bias in the simulated nutricline (Table 4, Figures S2 and S3). Such points highlight that the representation of the biological pump shall be improved in order to better simulated nutricline, carbocline and oxycline.

4.2 Mixed-layer depth

The mixed-layer depth is more realistic in IPSLCM5 and CNRMCM5 relative to the older IPSLCM4, particularly in summer. Hence the two AR5-era models provide better dynamics for phytoplankton growth in the light-limited high latitudes. Light limitation occurs when the depth of the mixed-layer is deeper than the critical depth. A summer mixed-layer that is shallower than observed can induce excess simulated productivity, particularly in the regions where nutrients are replenished by the ocean dynamics (e.g., Sarmiento et al. 2004, Behrenfeld 2005, 2010; Behrenfeld et al. 2006). Thus biases in surface CHLa often indicate problems with a model’s local representation of light-to-nutrient limitation.

Even if SST is known to control the growth of phytoplankton impacting CHLa surface concentrations, the correlation between simulated SST and CHLa (R~0.4) is lower than the correlation between simulated mixed-layer depth and CHLa (R > 0.7) in high latitudes. In particular, mismatches between observed and simulated CHLa concentrations in IPSLCM4 and IPSLCM5 in the Southern Ocean are linked to the mixed-layer (Fig. 10). The summer mixed-layer depth is shallower in both IPSLCM4 and IPSLCM5 than observed. These same models overestimate the prevalence of low surface CHLa concentrations (i.e., 0.12–0.5 μgChl L−1) at surface due to the light-to-nutrient limitation. In IPSLCM5, the summer mixed-layer depth has improved relative to IPSLCM4 (Figs. 3, 10), but surface CHLa remains higher than observed (Figs. 8, 10) due to the improved nutrient supply (see Sect. 4.3).

Fig. 10
figure 10

Scatter-plot of surface chlorophyll (μgChl L−1, top panel) and wind stress (in Pa, bottom panel) versus summer mixed-layer depth (m) for observations (black), IPSLCM4 (green), IPSLCM5 (red), and CNRMCM5 (blue) over the Southern Ocean (<30°S). For the observations, we use mixed-layer depth climatology (de Boyer Montégut et al. 2004), NCEP wind stress reanalysis (Kalnay et al. 1996) and the SeaWiFS chlorophyll climatology (Carr et al. 2006). Black thin lines represent the linear relationship between surface chlorophyll or wind stress and summer mixed-layer depth. Summer mixed-layer depths do not exceed 100 m typically; deep observed summer mixed-layer depths are due to the relatively low density difference used to define mixed-layer depth

The modified mixed-layer parametrization used in CNRMCM5 leads to better represent phytoplankton dynamics in the high-latitudes. In the Southern Ocean, the simulated area of light-limited zone is 150 × 106 km2 in CNRMCM5, while it is 74 × 106 and 91 × 106 km2 in IPSLCM4 and IPSLCM5, respectively. The deeper summer mixed-layer depth due to the wind energy means that phytoplankton become more light limited. Yet this coupling between surface winds and mixed-layer depth amplifies effects from atmosphere biases. Thus, in the Equatorial Pacific, CNRMCM5 overestimates the depth of the summer mixed-layer, which results in less primary productivity, thereby reducing its the estimation of global marine production (Table 4); given that the Equatorial Pacific is one of the most productive region of this ocean (e.g., Behrenfeld et al. 2006).

4.3 Surface winds

Our results suggest that surface wind patterns and large-scale geostrophic circulation are better in IPSLCM5 and CNRMCM5 compared to IPSLCM4. Surface winds affect on the ocean dynamics through Ekman transport, which in turn affects the distributions of tracers in the upper ocean. But the strength of nutrient supply from Ekman-driven upwelling is difficult to diagnose. As soon as nutrient-rich waters upwell, nutrients brought into the euphotic zone are rapidly consumed by the biota and advected by upper ocean circulation (Sarmiento et al. 2004).

Hourdin et al. (2012) that in climate models, horizontal resolution of the atmospheric component affects the location of westerly winds and hence the strength of the Antarctic Circumpolar Current at Drake Passage. Thus atmospheric model resolution affects simulated ocean Ekman transport and thus nutrient supply. Furthermore, our results suggest a link between the location of maximum (winter) zonally-averaged wind stress (a proxy of westerly winds) and the maximum of zonal inventory of annual-mean macronutrients of the upper 500 m (Fig. 11). Although CNRMCM5 exhibits different resolution in its ocean component, we found that the location of maximum zonal inventory of annual-mean macronutrients and the location of maximum zonally-averaged wind stress are also shifted southward. Generally then, improving surface winds should à priori lead to more realistic distributions of biogeochemical tracers.

Fig. 11
figure 11

Scatter-plot of latitudinal locations of maximum (winter) wind stress and maximum (winter) on zonal macronutrients inventory between 0 and 500 m for observations, IPSLCM4, IPSLCM5, and CNRMCM5. Observations are from the wind stress reanalysis of NCEP (Kalnay et al. 1996) and the nitrate climatology of WOA 2009 (Locarnini et al. 2009)

Nonetheless, we must be careful not to over-generalize. First, coarse-resolution ocean models tend to overestimate the response to surface winds (e.g., Farneti et al. 2010; Ito et al. 2010). Second, the three Earth system models studied here use ocean component, that does differ in terms of horizontal resolution, vertical resolution and parametrizations. A model’s response to surface wind forcing and resulting Ekman transport is highly dependent on the coefficients used for the Gent and McWilliams (1990) parametrization (Farneti et al. 2010; Sallée et al. 2010). The choice of these coefficients in the three Earth system models, varies from 600 m2 s−1 in CNRMCM5 to 2,000 m2 s−1 in IPSLCM4 and IPSLCM5. Thus we cannot now separate effects purely due to resolution of the atmospheric component. A priority for future research should be to compare different versions of an Earth system model differing only by its atmospheric component resolution, as it was done in Russell et al. (2006), i.e., with the same ocean component and Gent and McWilliams (1990) parametrization.

4.4 Deep-ocean ventilation

The three models have difficulties to properly ventilate deep water masses. Thus they fail to adequately represent some ocean-atmosphere or ocean-sea ice interactions that drive light-to-dense water transformation, especially in the Southern Ocean. Our analysis indicates that most of these misrepresentations are fueled from biases in thermodynamic properties of the intermediate (i.e., between 800 and 1,200 m) and deep (i.e., between 1,200 and 2,500 m) water masses (Fig. 12). Particularly, all models fail to reproduce the salinity of the intermediate and deep water masses. This points toward the poor representation of freshwater flux in models, although exchange of freshwater is one of the most important interactions between the ocean, the atmosphere, and the sea-ice in high latitudes. Biases in freshwater fluxes stem from a combination of misrepresentation or biases found in the different components of Earth system models. For example, it is likely the warm biases over the Southern Ocean, diagnosed from sea surface temperature anomalies (Fig. S1) inhibit subsurface waters from ventilating of the deep ocean. Yet model differences in high latitude salinity can stem from the different sea-ice models; knowing they exhibit large differences in term of sea-ice dynamics, thermodynamics, and rheology. For instance, sea-ice dynamics contributes to the formation of intermediate water mass that is characterized by a minimum of salinity (Sloyan and Rintoul 2001). Concerning CNRMCM5, it is likely that the stronger than observed freshwater flux could lighten surface density hence inducing a slow-down of the Southern Ocean deep ventilation.

Fig. 12
figure 12

Zonal average of seasonal climatology of temperature (T, in °C), salinity (S), and CFC12 (nmol L−1) over 1995-2005. Observed climatology is based on GLODAP (Key et al. 2004) for CFC12 and World Ocean Atlas 2009 for temperature (Locarnini et al. 2009) and salinity (Antonov et al. 2009). Anomalies between modeled and observed CFC concentrations are superimposed in black thin lines over the IPSLCM4, IPSLCM5 and CNRMCM5 panels of zonal averaged CFC concentrations

Such differences in water mass properties induce biases in CFCs inventory (Figs. 12, S7). As IPSL models underestimate the light-to-dense formation of intermediate waters, the CFCs are stored in the mode waters (i.e., between 0 and 1,000 m) and the deep waters (i.e., below 2,000 m). On the contrary, CFCs are stored in the upper layer of the intermediate waters in CNRMCM5, because the volume of this water mass is overestimated in CNRMCM5. The mean CFC11/CFC12 age indicates that mode and upper layer of the Southern Ocean intermediate waters (i.e., between 500 and 1,000 m) are older than observed by 5 years for IPSLCM5 and 12 years for CNRMCM5. This bias is typically found in z-coordinate ocean models, which struggle to produce adequate flow of dense waters along continental slopes (Doney 2004).

To evaluate models in terms of their export of young subsurface waters to the deep ocean, we use scatter plot of surface density anomalies versus anomalies of O2 inventory between 1,000 and 2,000 m (Fig. 13). This shows (1) how models mean-state compare to the observations at 95 % of significance and (2) how models compare in regards their skill in ventilating the deep ocean. The IPSL models broadly capture the distribution of surface density. Yet they slightly underestimate the inventory of O2 contained between 1,000 and 2,000 m. Conversely, CNRMCM5 completely fails to reproduce the distribution of O2 inventory with the right surface density. This result is consistent with Fig. 12 showing that CNRMCM5’s surface density is too low; thus does not ventilate deep waters with an adequate supply of O2.

Fig. 13
figure 13

Scatter plot of the winter averaged surface density anomalies (in kg m−3) versus anomalies of annual-mean oxygen inventory between 1,000 and 2,000 m. Dashed black lines represent the 95 % confidence interval for surface density anomalies and oxygen inventory anomalies. Green, red, blue point represents IPSLCM4, IPSLCM5 and CNRMCM5 respectively

Thus improved ocean-sea ice coupling should lead to a better representation of the sea surface salinity in the high latitudes, a major weakness of these models, helping to improve the representation of the winter mixed-layer depths and deep-ocean ventilation, especially in the Southern Ocean.

Biases in deep-ocean ventilation affect subsurface distribution of tracers (e.g., CFC11 concentrations, Figs. 4, 12). In CNRMCM5, sluggish deep ventilation results in suboxic deep waters (O2 concentration are lower than 60 μmol L−1), particularly in the North Pacific. Suboxic waters stimulate denitrification, a sink for NO 3 at depth. In the PISCES model, the nitrogen that is removed by denitrification is compensated by nitrogen fixation at surface. Thus nitrogen is conserved but errors in deep-ocean O2 result in errors in denitrification and those propagate to the surface ocean through compensating nitrogen fixation.

4.5 Spinup strategy versus skill assessment

Using physical fields to diagnose problems in deep-ocean ventilation is not straightforward. It requires for example a complex water mass framework (Iudicone et al. 2008; Downes et al. 2011). In comparison, biogeochemical tracers that are integrated for thousand years or more, in order to reach near-steady state conditions, contain strong signatures from the circulation. In that context, our results demonstrate that passive tracers like CFCs, which are integrated during ~50 years, provide a good basis for assessing models. It allows to enrich the observation mix hence better constrain ocean dynamics. Thus there are several advantages to using biogeochemical tracers in addition to diagnose only the ocean dynamics; especially because some biogeochemical data like surface chlorophyll or primary productivity are continuously monitored several decades.

However, it does matters how an ocean biogeochemical model is initialized. By first integrating an ocean biogeochemical model offline for thousands of years, until it reaches near steady state “natural” conditions, resulting biogeochemical fields are consistent with its ocean circulation fields. At this state, the intrinsic variability of the biogeochemical fields is stronger than their drift. Alternatively, the more economical approach of initializing the ocean biogeochemical model with observed climatologies and integrating for much shorter periods before it is run as part of the Earth system model guarantees that simulated ocean biogeochemical fields are close to those observed but at the cost of substantial model drift. The “steady-state” spin-up is preferable for analyzing the seasonal cycle, climate variability, and regional sensitivities as well as the model’s true natural state.

However, our results illustrate that the same approach may be problematic if a model poorly represents deep-ocean ocean circulation. Thus not only model skill but also projections depend on the approach selected for initializing the ocean biogeochemical model. Yet in CMIP5, there is no recommended protocol for initializing biogeochemical models. In subsequent international efforts to compare Earth system models, we recommend that model groups discuss this issue to determine if a common approach is possible and to evaluate the consequences if one is not taken. Meanwhile, sensitivity tests using both approaches to initialize the ocean biogeochemical component of one or more Earth system models would be of great interest for the community.

5 Conclusions

We have assessed three Earth system models with a common marine biogeochemistry: IPSL-CM4-LOOP, IPSL-CM5A-LR and CNRM-CM5.1. The former contributed to IPCC AR4 (CMIP3) while the latter two are contributing to IPCC AR5 (CMIP5). Evaluation of physical fields (wind stress, shortwave radiation, net heat flux, SST, SSS, MLD, and water mass transport) and biogeochemical fields (macronutrients, O2, DIC, Alk, \(\hbox{CHL}_{\rm a}, \Updelta p \hbox{CO}_{2}, \) air-sea exchange of CO2, export of organic matter, silicate and calcite production, and primary productivity) demonstrated that models generally differ most in the Southern Ocean. In that region, IPSL-CM5A-LR and CNRM-CM5.1 offer much poorer representations of deep-ocean circulation than does IPSL-CM4-LOOP. Model biases in deep-ocean ventilation affect deep tracer concentrations, which are lower than observed. Nonetheless, improvements in surface winds, mixed-layer depth, and large-scale geostrophic circulation for IPSL-CM5A-LR and CNRM-CM5.1 contribute to produce more accurate simulated surface-ocean biogeochemical fields (e.g., CHLa, meridional gradients of macronutrients, location of ingasing and outgasing regions). Our results emphasize the need of Earth system models to better represent the summer mixed-layer depth, which plays a fundamental role when simulating marine biogeochemistry, particularly in the high latitudes.

Our study demonstrates that a systematic methodology based on basic skill assessment metrics (e.g., bias, root mean square error) can be valuable in understanding ocean dynamics and biogeochemistry of Earth system models. Indeed, much could be learned from basic statistical metrics and ocean tracers in a routinely skill assessment process; especially because the use of biogeochemical or passive tracers can help to constrain the ocean dynamics through additional data. Concerning the skill assessment of biogeochemical models, the use of derived metrics like Alk-DIC allows focusing on models properties, not their biases.

For the next IPCC assessment, much could be learned if Earth system models would include one simple passive tracer, like CFC11, in their ocean component (e.g., run optionally for 50 years as one additional passive tracer in the online ocean component). With this one anthropogenic tracer, it would be straightforward to assess ocean model skill in ventilating subsurface waters (e.g., England 1995; Smethie et al. 2001; Dutay et al. 2002; Matsumoto et al. 2004, 2010) which is crucial for ocean biogeochemistry. Such simulations were performed routinely a decade ago for evaluating ocean circulation component of ocean carbon models during OCMIP. It would be highly advantageous to institute the same practice for subsequent phases of CMIP and even in the ongoing ocean models development.

However, skill assessment of ocean dynamics is not the whole story. Earth system models contributing to CMIP5 differ enormously, both in their physical components (atmosphere, ocean, sea-ice) and in their biogeochemical components. As atmosphere, ocean, land and sea-ice models, biogeochemical models also need to be improved to better represent subtle biogeochemical processes (Doney 1999). Although the PISCES ocean biogeochemical model that we have used here may be considered state of the art, there is still need for improvements (e.g., biological pump) and greater diversity (e.g., non-Redfield models, phytoplankton type models). In that context, sensitivity experiments and parametrization comparison (e.g., Gehlen et al. 2006; Kriest et al. 2010) would be conduct in parallel to the ongoing coupled models skill assessment.