1 Introduction

A hierarchy of models, including ESMs and SCMs, have been used during the preparation of the Intergovernmental Panel on Climate Change Sixth Assessment Report (IPCC AR6) (Masson-Delmotte et al. 2021; Shukla et al. 2022). SCMs benefit from computational efficiency, while ESMs have a more detailed representation of complex Earth system processes. SCMs allow extending simulations by ESMs for a larger set of scenarios, but their fidelity to the more complex models is a key to this modeling strategy. SCMs can be calibrated against either observations or ESMs, or both. Phases 1 and 2 of the Reduced Complexity Model Intercomparison Project (RCMIP) provide an overview of the approaches used to constrain SCMs against observations to reproduce responses of the Earth system variables, e.g., temperature and ocean heat uptake (Nicholls et al. 2020, 2021). They show that SCMs constrained to observations against key benchmarks perform well in estimating these benchmarks (Nicholls et al. 2021). This leads, particularly, to lower estimates of future increases in global surface air temperature (GSAT) compared to the sixth phase of the Coupled Model Intercomparison Project (CMIP6) ESMs that are not constrained (Eyring et al. 2016). At the same time, applying observationally based warming constraints to the CMIP6 ESMs reduces the ensemble mean of GSAT estimates (Tokarska et al. 2020). Calibrating SCMs against a single or a range of ESMs, i.e., using them as climate model emulators, enables replicating and interpolating the climate responses of complex ESMs in a large set of scenarios (Nicholls et al. 2020; Forster et al. 2021). Yet, SCMs can diverge from ESMs because they lack the model structure needed to represent some particular climate process or climate feedback.

Previous studies, to some extent, examined the consistency between SCMs and ESMs (Joos et al. 2013; Nicholls et al. 2020, 2021; Liddicoat et al. 2021). Among CMIP6 generation model-based studies, RCMIP evaluated the SCMs with a focus on the response of climate variables (Nicholls et al. 2020, 2021). Yet, its comparison to ESMs remained limited, because the carbon cycle was examined only via the transient climate response to cumulative carbon emissions (TCRE). Liddicoat et al. (2021) focused on evaluating CMIP6 ESMs using the concentration-driven Shared Socioeconomic Pathways (SSP) simulations and compared their fossil fuel (FF) emissions compatible with the prescribed CO2 concentration pathway with the FF emissions that are generated by integrated assessment models (IAMs) and harmonized to make those SSP scenarios (Gidden et al. 2019). However, SCMs were out of the scope of their study. Joos et al. (2013) evaluated the responses of Earth system models of different complexities to a CO2 emission pulse. However, the authors focused on idealized emission-driven simulations and provided limited details on the land and ocean carbon cycle processes that induce uncertainties in global warming and temperature change. Limited research has been performed to compare the carbon cycle responses between CMIP6 ESMs and SCMs, with Quilcaille et al. (2023) being an exception.

Concentration-driven simulations, as opposed to their emission-driven counterparts, ensure consistency between background CO2 concentrations across models and thus enable a more consistent analysis of carbon cycle processes under the same CO2 concentrations (Gregory et al. 2009; Arora et al. 2020). In concentration-driven simulations, compatible FF CO2 emissions (EFF) can be calculated as the difference between the prescribed atmospheric CO2 growth rate (GAtm) and the estimated net ocean- (SOcean) and land- (SLand) carbon fluxes, accounting for land-use change (LUC) emissions (Liddicoat et al. 2021):

$${E}_{FF}={G}_{Atm}+{S}_{Land}+{S}_{Ocean}$$
(1)

The consistency of compatible emission estimates between SCMs and ESMs is crucial because SCMs are widely used in climate negotiations, e.g., when assessing the adequacy (or lack) of the nationally determined contributions with respect to climate objectives (Tanaka and O’Neill 2018; Shukla et al. 2022). Here, we explore whether prescribed CO2 concentration trajectories in SCMs and ESMs result in consistent carbon cycle fluxes, including but not limited to compatible emissions. We analyze the concentration-driven outputs of temperature and land and ocean carbon cycle fluxes simulated by RCMIP SCMs (Table 1) and CMIP6 ESMs (Table 2) under the eight available SSPs. The SSPs are developed by IAMs that describe the social and economic components that can determine future climate change (Ackerman et al. 2009; van Vuuren et al. 2017) (Table 3, Fig. 1). In CMIP6, the emissions of different greenhouse gases (GHGs) from IAMs are then harmonized for the base year to their respective values in the historical inventories using the Aneris software (Gidden et al. 2019) and converted to concentrations by the SCM MAGICC version 7 (Fig. 1c). The land-use and land cover data from IAMs are converted to a readable gridded format suitable for ESMs within the harmonization of global land-use change and management version 2 (LUH2) project (Hurtt et al. 2017, 2020).

Table 1 List of RCMIP Phases 1 and 2 models analyzed in this study and their characteristics
Table 2 List of CMIP6 ESMs analyzed in this study and their characteristics
Table 3 SSP scenarios and their characteristics
Fig. 1
figure 1

Global cumulative a FF and b LUC CO2 emissions (GtC, relative to the year 2000) generated by the IAMs that provided the CO2 emission pathways corresponding to each SSP scenario and c atmospheric CO2 mixing ratio calculated by MAGICC7 and used as an input for concentration-driven ESM simulations

In this study, we tackle the following research questions:

1) Are the historical carbon fluxes estimated by ESMs and SCMs consistent between each other and with observations?

2) How do the future carbon fluxes estimated by ESMs and SCMs under SSPs compare against each other?

3) What are the sources of inconsistencies between carbon fluxes estimated by ESMs and SCMs?

This study concludes with a set of recommendations for improving the representation and consistency of the carbon cycle processes in complex and simple models.

2 Methods

2.1 Data

We analyzed the outputs of concentration-driven SCMs (Table 1) and ESMs (Tables 2, S3) simulations under eight SSPs of ScenarioMIP: SSP1-1.9, SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP4-3.4, SSP4-6.0, SSP5-3.4-OS, and SSP5-8.5 (O’Neill et al. 2014, 2016; Riahi et al. 2017) (Table 3) that were available from the RCMIP Phases 1 and 2 archives (Nicholls et al. 2020, 2021). We used the following variables of seven SCMs: surface air temperature change, net ocean-to-atmosphere flux CO2, net land-to-atmosphere flux CO2 (that is “natural” land sink without accounting for LUC emissions), net primary production (NPP), and agriculture, forestry and other land use (AFOLU) emissions. In further calculations, we used the difference between net land-to-atmosphere flux CO2 and AFOLU emissions as SLand and net ocean-to-atmosphere flux CO2 as SOcean, both positive sinks to land/ocean. SCMs were observationally constrained in the historical calibration and probabilistic setups (see Table 1, description papers, and Nicholls et al. (2021)).

We used the following variables of nine ESMs (one ensemble member for each ESM): surface air temperature (tas), net biome production, NBP (nbp), gas exchange flux of CO2 (fgco2), net carbon mass flux into atmosphere due to land-use change (fLuc), gross primary production (gpp), autotrophic respiration (ra), and heterotrophic respiration (rh). In further calculations, we used nbp and fgco2 as SLand and SOcean. The fLuc variable of ESMs provides incomplete quantification of LUC emissions that excludes forest regrowth and legacy soil carbon decay or gains (Ciais et al. 2022). Among SCMs, only OSCAR of RCMIP Phase 1 internally calculated LUC emissions.

For ESMs, we used anomalies relative to the long-term mean piControl values, removing any residual trend in the piControl experiment. For SCMs, we estimated the NBP as the difference between their net land-to-atmosphere flux CO2 and LUC emissions from IAMs provided in the SSP database (except for OSCAR, which provided both land carbon sink and LUC emission outputs). When ESMs provided NBP and fLuc estimates, we calculated the “natural land carbon sink” (excluding LUC emissions) by adding fLuc (positive to atmosphere) to NBP. We used the CO2 concentration by Meinshausen et al. (2020) to estimate GAtm for Eq. 1.

To evaluate GSAT estimates of the models, we used the mean of the following observationally based datasets: Cowtan and Way v2 (Cowtan and Way 2014), GISTEMP v4 (NASA Goddard and Institute for Space Studies 2020), CRU TS 4.00 (Harris et al. 2014), University of Delaware 4.01 (Lawrimore et al. 2011), and reanalyses-observation hybrids — CRU-NCEP and Princeton University (Sheffield et al. 2006). To evaluate historical carbon fluxes, LUC, and FF emissions with associated uncertainties, we used the Global Carbon Budget 2021 (GCB2021) dataset (Friedlingstein et al. 2021), historical data from Gruber et al. (2019), Khatiwala et al. (2009), and Li et al. (2016). GCB2021 is a synthesized dataset of major global carbon budget components, including FF emissions based on energy statistics and cement production data, LUC emissions based on land use and land-use change data and bookkeeping models, ocean carbon uptake based on estimates of global ocean models and observationally-based data products, and land carbon sink based on dynamic global vegetation models (Friedlingstein et al. 2021). Historical data from Gruber et al. (2019) include ocean carbon flux based on observations from US Global Ocean Carbon and Repeat Hydrography Program. Historical data from Khatiwala et al. (2009) include reconstruction of anthropogenic ocean carbon flux using tracer observations. Flux estimates of Gruber et al. (2019) and Khatiwala et al. (2009) include FF emissions from Boden et al. (2009) and LUC emissions based on bookkeeping models. For the two datasets, Khatiwala et al. (2009) and Boden et al. (2009), we calculate the land carbon flux using Equation 1. Finally, Li et al. (2016) provide estimates of land and ocean carbon uptakes, as well as LUC and FF emissions. These estimates are constrained with the Bayesian fusion approach and averaged over 5-year periods between 1980 and 2014. The constraining process of Li et al. (2016) estimates involves all the datasets described above (although earlier version of GCB). We used the posterior values and uncertainties from Table S1 of Li et al. (2016) to estimate cumulative values of two 15-year periods (1980–1994 and 2000–2014) and evaluate models (Fig. S1).

2.2 Evaluation of the models and model configurations

We selected ESMs and SCMs that provide carbon flux estimates for at least one SSP at the time of the analysis. For three ESMs, CanESM5, IPSL-CM6A-LR, and MIROC-ES2L, we estimate internal uncertainty, i.e., the range of ensemble runs of individual ESMs, based on the 5–50 ensemble members (depending on availability) that differ in their initial conditions and physical processes (only CanESM5). Further, because some SCMs were available in multiple versions and calibrations, we selected/configurations based on an evaluation of their performance during the historical period and future SSP scenarios to ensure an equal weight to each model in estimating the model-ensemble mean. To measure the similarity between time series produced by models, we used the figure of merit in time (FMT), also known as Ruzicka similarity, which is defined as

$$FMT=\frac{\sum_{t=1}^{tmax}\min \left({X}_t,{Y}_t\right)}{\sum_{t=1}^{tmax}\max \left({X}_t,{Y}_t\right)},$$
(1)

where Xt and Yt are values of two sets of time series at a fixed time location t. FMT ranges from 0 to 1 with lower values indicating less similarity. A phase shift in time series causes a decrease in FMT, and ESM simulation outputs with large interannual variability are expected to have lower FMTs than SCMs (Figs. S2S8). Besides, lower FMTs during historical period than future scenarios may indicate dominance of long-term trends over interannual variability.

When there were multiple versions or calibrations of a single model (e.g., Hector, MCE, WASP), we selected one version among the multiple versions or calibrations. Specifically, we used Hector default calibration (and not Histcalib). Although two calibrations led to similar results compared to other models, the DEFAULT calibration provided data for more scenarios. We used the newer version of MCE (v.1-2 from RCMIP Phase 2) rather than MCE v.1-1 (from RCMIP Phase 1). Finally, we used the historical calibration of WASP so that all considered SCMs are in historical calibration or probabilistic setup. Although OSCAR provided outputs for both phases of RCMIP, we utilized phase 1 outputs because they had explicit LUC emission estimates (including loss of additional sink capacity) obtained using a bookkeeping model (Gasser et al. 2020). ACC2 did not provide carbon cycle outputs for the concentration-driven SSPs in RCMIP. Thus, we additionally performed CO2 concentration-driven simulations for ACC2.

RCMIP Phase 2 focused on the evaluation of the probabilistic climate based on SCMs. However, it did not discuss the carbon cycle in detail. Although we mainly used the central (50th percentile value) estimates of SCMs for this study when simulations were performed in a probabilistic setup, here we also discuss the assessment range of SCMs driven in a probabilistic setup (Table 1). We compare the SCM model spread and the probabilistic SCM distribution against ESMs estimate. We also compare inter-model SCM uncertainty and their probability ranges to the internal uncertainties of selected ESMs (ranges of ensemble runs of individual ESMs). Although we do not use all SCM versions/calibrations for computing ensemble means, we show the separate SCM estimates in Figs. S11S18.

3 Results

We evaluated carbon fluxes from ESMs and SCMs in historical and future periods. In the historical period, the differences between estimates by complex and simple models could be attributed to the procedure of observational constraining that is present in the majority of SCMs (Table 1) but not applied in ESMs. In the future period, the differences could be partly explained by a few model outliers (both ESMs and SCMs). Besides, the differences in future scenarios arose from the differences in the way how LUC emissions were calculated and the assumptions on carbon-concentration and carbon-climate feedbacks. In this section, we discuss these sources of the differences in projections between complex and simple models.

3.1 Performance of models in the historical period

The larger warming estimated by CMIP6 ESMs than by SCMs over the historical period was discussed by Liddicoat et al. (2021) and Nicholls et al. (2021). It can be attributed to (i) a higher climate sensitivity of CMIP6 ESMs (on average) and (ii) the fact that SCMs are constrained to historical observations (Fig. 2a). The estimates of total (FF and LUC) compatible CO2 emissions by CMIP6 ESMs are in their majority within the uncertainty range of GCB2021. The corresponding estimates by SCMs are lower and outside the uncertainty range of the GCB2021 historical CO2 emissions (Fig. 2c). CMIP6 ESMs tend to estimate higher compatible FF emissions than SCMs and higher cumulative compatible FF emissions than observationally based FF emissions over the last three decades (Fig. 2f, h–m). Consistent with GCB2021 estimates of total compatible emissions and higher than GCB2021 estimates of compatible FF emissions by ESMs imply their underestimation of LUC emissions, discussed by Melnikova et al. (2022) (Fig. 2g).

Fig. 2
figure 2

a Simulated GSAT change (in °C, relative to 1850–1899) evaluated against the mean of six observational datasets for the 1965–2015 period. Same as a for b NBP (land sink with LUC emissions), c compatible FF and LUC CO2 emissions NBP, d ocean carbon sink, e “natural” land sink without LUC emissions, f compatible FF CO2 emissions, and g LUC emissions, evaluated against GCB2021 (black) and GCB2021 data-driven ocean sink with residual land sink (green) in GtC year−1. Compatible FF CO2 emissions of models are compared to GCB2021 FF CO2 emissions. The 5-year moving averages of model-ensemble means are shown. Gray shading indicates uncertainty from the SD of observational datasets for GSAT and the uncertainty provided by GCB2021. The percentage of ESMs and SCMs that have consistent (within the uncertainty range provided with the data), higher, or lower estimates of decadal and cumulative global NBP, natural land sink, ocean carbon flux, and compatible FF CO2 emissions with historical observationally-based datasets (h) over 1980–2011 and i 1850–2011 periods by Khatiwala et al. (2009), j over 1980–2014 period by Li et al. (2016), k over 1994–2007 period by Gruber et al. (2019), and l over 1990–2020 and m 1960–2020 period by GCB2021 and. ESMs and SCMs are shown in bright and pastel colors, respectively. Note that figures show the inter-model spreads of median projections from each model, without accounting for uncertainties within each model projections

In the concentration-driven model simulations, the estimates of compatible FF emissions are directly linked to the land and ocean carbon uptakes. Both ESMs and SCMs underestimate decadal ocean carbon uptake for the current period relative to estimates by Li et al. (2016), who reduced the carbon flux uncertainty using a Bayesian fusion approach. But the cumulative ocean carbon uptake estimates by both ESMs and SCMs are within the uncertainty range of historical estimates by Gruber et al. (2019) and GCB2021. ESMs, with a few exceptions, estimate slightly higher NBP than historical observationally-based estimates and SCMs over the historical period (Fig. 2b, h–m). The higher estimates of cumulative land carbon uptake by ESMs are not fully compensated by the lower estimates of cumulative ocean carbon uptake (relative to historical observationally based datasets), which leads to higher compatible FF emissions (Fig. 2h–l). In contrast, the lower estimates of cumulative land carbon uptake (NBP, Natural Land Sink) by ~20% of SCMs lead to lower compatible FF emissions estimates.

3.2 Emergent constraints

We evaluated the ESMs and SCMs’ response to the future forcing using emergent constraints (ECs). EC approaches were broadly used in CMIP5 and CMIP6 communities on future warming (Tokarska et al. 2020; Schlund et al. 2020) and carbon cycle (Cox et al. 2013; Varney et al. 2020; Wenzel et al. 2014). While existing studies offer ECs on specific aspects of the carbon cycle, such as tropical carbon sensitivity to warming (Cox et al. 2013; Wenzel et al. 2014) or soil carbon turnover (Varney et al. 2020), we attempted to develop a statistical relationship between global carbon fluxes. To this end, we plotted the estimates of 2015–2049 and 2065–2099 cumulative carbon fluxes against the estimates of 1980–2014 cumulative fluxes over the historical period (Figs. S9 and S10). The ESMs that estimate higher land and ocean uptakes during the historical period also give higher future carbon uptake in high CO2 concentration scenarios. The correlations weaken (in terms of statistical significance) with time so that they are more reliable in the earlier future period and are not sustained in the low CO2 concentration and overshoot pathways. This might be related to the more complex nature of mitigation scenarios that include ramp-up and ramp-down phases of CO2 concentration and GSAT, as well as assumptions on implementing the land-based CO2 removal technologies in the climate mitigation scenarios that influence the land carbon sink. A few model outliers also weaken the ECs. For example, compared to other models, WASP-v2 and CanESM5 simulate much larger increases in ocean and land carbon uptake, respectively, under high-concentration scenarios. The time series of the future carbon fluxes for each model confirmed these deviations (Figs. S13S18). There is a weaker (less reliable) EC for SCMs due to the larger range of carbon cycle feedbacks to the changes in CO2 and GSAT under SSPs, as well as the historical constraining of SCMs’ carbon cycle feedbacks (see Section 3.4.3).

3.3 Performance of models in the future scenarios

The discrepancies in the future GSAT change estimates between CMIP6 ESMs and SCMs are consistent with those during the historical period (Fig. 3a). The GSAT increase is always higher in ESMs than in SCMs, the compatible CO2 emissions are nearly consistent between ESMs and SCMs (Figs. 3, S1118). Compared to SCMs, the ESMs estimate slightly lower inter-model ensemble mean cumulative emissions in high-concentration and slightly higher emissions in low-concentration SSPs. The larger GSAT increase estimated by ESMs and the consistent cumulative compatible emissions between ESMs and SCMs indicate that ESMs have higher TCRE than SCMs in the historical period and future SSP scenarios if we do not consider the contributions from non-CO2 forcing (Fig. S19).

Fig. 3
figure 3

a GSAT change (in °C, relative to 1850–1899), cumulative compatible b total, c FF and d LUC CO2 emissions, e ocean carbon flux, f NBP (land sink with LUC emissions), and g natural land sink without LUC emissions estimated by ESMs (solid lines) and SCMs (dashed lines) under SSP scenarios over the 2000–2100 period (in GtC). FF and LUC CO2 emissions generated by IAMs (dotted lines) corresponding to each SSP scenario are provided for reference. Shaded areas indicate the ESMs and SCMs inter-model spread for each scenario as one standard deviation (SD). Note that because LUC emissions are estimated by only one SCM, OSCAR, no spread is given. 5-year moving averages are shown. Note that figures show the inter-model spreads of median projections from each model, without accounting for uncertainties within each model projections

LUC emission data are largely inconsistent between ESMs and IAMs (used in SCMs) partly because of the discrepancies that emerge during the translation of the data from IAMs to ESMs (Melnikova et al. 2022). ESMs estimate lower positive LUC emissions (net source to the atmosphere), possibly because they do not include forestry and managed land practice. They also estimate smaller negative LUC emissions (net sink to the land) relative to emissions created by IAMs. The discrepancies may arise because LUC emissions reported by ESMs include deforestation (biomass loss during deforestation), wood harvest, and the carbon release by harvested wood products, but do not account for forest regrowth and legacy soil carbon decay or gains (Melnikova et al. 2022). Among SCMs, only one model, OSCAR, estimates LUC emissions via a bookkeeping approach.

Both ESMs and SCMs estimate higher future compatible FF emissions compared to those simulated by IAMs (Fig. 3c). This issue has been previously discussed by Liddicoat et al. (2021) and may be related to lower estimates of land and ocean carbon uptakes by MAGICC7.0, which was used with IAMs to generate the CMIP6 ScenarioMIP input CO2 concentrations. MAGICC7.0 (slightly different version from MAGICCv7.5.1 of RCMIP phase 2) was calibrated to CMIP5 ESM carbon cycle to include permafrost CO2 and methane feedbacks (Meinshausen et al. 2020; Nicholls et al. 2021). MAGICCv7.5.1 estimates a lower land carbon uptake than observationally based datasets during the historical period and than the model ensemble means in future scenarios (Figs. 1, S1516). However, the lower land carbon uptake estimates by MAGICCv7.5.1 are partly compensated by higher ocean carbon uptake estimates, especially in future scenarios. Thus, the total future land + ocean carbon uptake simulated by MAGICCv7.5.1 is lower than those by other models. Such deviation of the carbon cycle behavior of MAGICC from other models has broader implications because MAGICC is widely used for future projections informing policies and for translating the IAM emissions to concentrations used by ESMs.

Despite the general agreement of the estimates of cumulative compatible FF emissions between ESMs and SCMs, their estimates of land and ocean carbon uptakes deviate. SCMs estimate higher ocean carbon uptake than ESMs in the historical period and future SSPs. The inter-model spread of cumulative ocean carbon flux in future scenarios is also larger in SCMs (Fig. S9). The estimates of cumulative NBP are larger in ESMs than SCMs over the historical period and all future scenarios, except for SSP4-3.4, which assumes low FF but high LUC emissions relative to other SSPs (Fig. 1, Table 3).

3.4 Probablistic distributions of SCM estimates

The probabilistic ranges of single SCMs are larger than the SCMs inter-model spreads for all scenarios with a few exceptions (Fig. 4). The MCE of RCMIP Phase 1 provides the largest probabilistic range for GSAT estimates and MAGICC for the carbon cycle. The SCM probability ranges are comparable to the CMIP6 ESM inter-model spreads for all considered variables (apart from WASP’s future ocean carbon flux estimates discussed in Section 3.5.6). Furthermore, the ranges for ocean carbon flux largely exceed the inter-model spread of CMIP6 estimates. Thus, despite the ESMs and SCMs differences in the median estimates of carbon fluxes, SCMs may be a useful tool for future probabilistic estimates of carbon cycle.

Fig. 4
figure 4

ad GSAT change (in °C, relative to 1850–1899), cumulative eh natural land sink without LUC (GtC), and il ocean carbon flux (GtC)by ESMs and SCMs (Phases 1 and 2) under selected SSP scenarios over the 1990–2014 historical period and 2081–2100 (GSAT) and 2015–2100 (fluxes). The box plots of SCMs are the assessed probabilities (50th percentile, central box line; 33rd and 67th percentiles, lower and higher box limits; and 17th and 83rd percentiles, whiskers); the box plots of three selected ESMs are shown by the 6-member ensemble mean values in the central box line (33rd and 67th percentiles, lower and higher box limits; and 17th and 83rd percentiles, whiskers). The ESM ensemble members differ in their initial conditions. The right side of each panel displays the inter-model means and percentiles for SCMs of RCMIP Phases 1 and 2 and CMIP6 ESMs

To increase the number of models, the study incorporated outputs of SCMs in different configurations, e.g., with warming calibrated to historical observations or with an equilibrium climate sensitivity (ECS) of 3K. Additionally, models from both phases of the RCMIP are combined. These differences may have affected future projections. For example, when comparing WASP-v2 calibrated with historical observations to WASP-v2 calibrated with an ECS of 3K, the former gives a lower estimate of future GSAT increase. While parameters other than ECS, such as those affecting CO2 fertilization and temperature feedback to the carbon cycle, are calibrated with historical observations (Table 1), the GSAT differences further affect the estimates of land and ocean carbon fluxes. Consequently, WASP-v2 gives a higher median estimate of carbon sink under future SSP scenarios when the model parameters are calibrated with historical observations.

3.5 The sources of discrepancies in the carbon cycle between ESMs and SCMs

The discrepancies between ESM- and SCM-ensemble means may originate from differences in the representation of some specific climate or carbon cycle processes and due to single or few model outliers that impact ensemble means. Besides, SCMs use observational constraints to reproduce the historical changes in climate and carbon cycle, while such constraints are not applied on ESMs. Here, we discuss how these effects could lead to the discrepancies in the carbon cycle between ESMs and SCMs.

3.5.1 Model outliers

We defined outlier models simply as the ESMs/SCMs that estimate maximum/minimum global land and ocean carbon uptake (maximum/minimum, or upper/lower ends) over 1850–2014 historical and 2015–2100 future SSP scenarios (Tables S1 and S2). The model outliers vary depending on the target carbon flux and SSP scenario, e.g., low vs. high FF or LUC emission pathways, increasing CO2 concentration, and temperature vs. mitigation pathways. Besides, the models that provide data vary with the target scenario.

In the case of land carbon flux, CanESM5 and CNRM-ESM2-1 estimate the highest NBP during historical and future periods among ESMs. These two ESMs do not include a nitrogen cycle explicitly, a process that is shown to limit the land carbon uptake through nitrogen limitations of plant growth (Arora et al. 2020). OSCARv3.1 and ACC2 estimate the highest NBP among SCMs. The land carbon cycle of the version of ACC2 adopted in this study has limited sensitivity to GSAT increase.

In the case of ocean carbon flux, ESMs are nearly consistent with each other (Figs. S6 and S17). However, CanESM5 estimates slightly lower ocean carbon uptake than other ESMs in all future SSPs. This has been shown in an existing study (Arora et al. 2020), but the reasons remain unclear. Among SCMs, WASP-v2 and MAGICCv7.5.1 (when WASP-v2 scenario outputs are not available for the scenario) give maximum and minimum ocean carbon fluxes, respectively, in most future scenarios. MCE-v1-2, which shows the lowest future ocean uptake among SCMs under mitigation scenarios, provides an estimate of ocean carbon uptake closest to ESMs. Furthermore, the response of ocean carbon fluxes to declining CO2 and temperature has larger hysteresis in some SCMs, such as SCM4OPTv2.1 (Fig. S17).

Removing the ESMs and SCMs that produce the maximum/minimum cumulative fluxes over a target scenario improves the agreement of the carbon flux estimates between concentration-driven ESMs and SCMs and reduces their ensemble spreads, especially for NBP but not for all the scenarios (Fig. S20). Removing model outliers is less effective on improving the agreement between models in the scenarios that were run by few (<5) models. Besides, there is a systematic difference in the cumulative ocean uptake between ESMs and SCMs, so the presence of outliers cannot fully explain the discrepancy in the ocean carbon cycle between ESMs and SCMs. After removing outliers, the discrepancy in the future ocean carbon uptake between ESMs and SCMs, albeit of a smaller magnitude, persists in all SSPs (Fig. S20).

3.5.2 LUC emissions

The differences in the land carbon fluxes by ESMs and SCMs arise from several sources. They are partly explained by the LUC emissions. The CMIP6 ESMs provide the simulation outputs of NBP, i.e., land carbon uptake accounting for incompletely quantified LUC emissions via the “fLuc” ESM variable, thus underestimating LUC emissions (Melnikova et al. 2022) (Figs. 3 and S14). Additional simulations for each SSP scenario with a fixed land cover (like the “hist-noLu” simulation of Land-Use MIP (LUMIP)) are required to separate the “natural” land sink from the gross LUC emissions, including the foregone sink of land exposed to LUC. On the other hand, among SCMs, only OSCAR has interactions between LUC emissions and land carbon cycle. All other considered SCMs do not estimate LUC emissions but directly use the prescribed values that come from several methods and models (Le Quéré et al. 2016; Gütschow et al. 2016) in the historical period and from IAMs in future scenarios. This may lead to underestimating the impact of LUC on the land carbon uptake, e.g., by ignoring the reduced carbon turnover time in LUC-impacted ecosystems (Erb et al. 2016; Melnikova et al. 2022). Furthermore, during the historical period, the global net LUC emissions are consistently positive, i.e., directed to the atmosphere. However, in future scenarios, the LUC emissions by design may be either negative or positive. For instance, in low-concentration scenarios like SSP1-1.9 and SSP1-2.6 that assume large-scale afforestation for climate change mitigation, net LUC emissions become negative, indicating a land carbon sink (Figs. 1b and S14). Consequently, discrepancies in future net LUC emission estimates between ESMs and SCMs (via IAMs) may be both positive and negative.

3.5.3 Calibration of SCMs with observational constraints

SCMs can undergo a calibration process using historical observations (Table 1), during which the model’s parameters are adjusted to accurately replicate our best-estimate observations, typically represented by median values and probabilistic distributions. The effectiveness of the calibration becomes evident as the resulting probabilistic distribution of model outputs successfully encompasses the historical range of observations (compare Fig. 2 and Fig. S21). While constraining SCMs allows for good agreement with existing observations, it is crucial to note that the range of model calibration may become too narrow when the models are applied to high warming scenarios. This narrow calibration range can lead to the misrepresentation of certain parameters within the models, thereby introducing bias into future projections. We discuss relevant examples, e.g., nitrogen limitation under high CO2 and carbon-climate feedback, in the following section.

3.5.4 Carbon-concentration and carbon-climate feedbacks under high CO2 concentration scenarios

Besides the discrepancy in the LUC component, part of it being due to issues with incomplete reporting of LUC fluxes by ESMs, ESMs and SCMs differ in the response of future cumulative land and ocean carbon fluxes to the CO2 and GSAT changes. The carbon flux responses exhibit variations between high-concentration (Fig. 5) and low-concentration and overshoot (Fig. 6) SSP scenarios.

Fig. 5
figure 5

Changes in a, e GSAT, global cumulative b, f, i compatible FF CO2 emissions, c, g, j ocean carbon flux, and d, h, k NBP estimated by ESMs and SCMs under SSP scenarios over 2000–2100 periods plotted against changes in ad CO2 concentration (ppm), eh CO2 growth rate (ppm yr−1), and ik GSAT change (°C, relative to 1850–1899) under high-concentration scenarios. Ensemble means of carbon fluxes are calculated based on a suite of models, excluding outliers. The 5-year moving averages of model-ensemble means are shown. Shaded areas indicate the ESMs and SCMs inter-model spread for each scenario as one SD, when multiple models are available. Note that figures show the inter-model spreads of median projections from each model, without accounting for uncertainties within each model projections

Fig. 6
figure 6

Changes in a, e GSAT, global cumulative b, f, i compatible FF CO2 emissions, c, g, j ocean carbon flux, and d, h, k NBP estimated by ESMs and SCMs under SSP scenarios over 2000–2100 periods plotted against changes in ad CO2 concentration (ppm), eh CO2 growth rate (ppm yr−1), and ik GSAT change (°C, relative to 1850–1899) under low-concentration and overshoot scenarios. Ensemble means of carbon fluxes are calculated based on a suite of models, excluding outliers. The 5-year moving averages of model-ensemble means are shown. Shaded areas indicate the ESMs and SCMs inter-model spread for each scenario as one SD. Note that figures do not show the internal model uncertainties and probability distributions

Under high-concentration scenarios (Fig. 5), SCMs estimate larger increases of land and ocean carbon uptakes per unit CO2 concentration and GSAT changes. ESMs estimate higher NBP per unit CO2 increase up to CO2 concentration level of 700–800 ppm. However, unlike SCMs, ESMs show a saturation of NBP with increasing CO2 concentration and GSAT. The possible explanations for this discrepancy include the model differences in the land carbon-concentration and carbon-climate feedbacks.

First, a large spread of NBP estimates by ESMs reaches ca. 20 GtC year−1 in 2100 under the high-concentration SSP5-8.5 scenario (Figs. S15h). In addition to LUC emissions, this discrepancy might be driven by the differences in the carbon-concentration (β) feedback, i.e., the impact of CO2 concentration changes on carbon. The differences in the β feedback may be related to the inclusion or absence of the nitrogen cycle in the models (Friend et al. 2014). Most ESMs analyzed in this study include the nitrogen cycle (Table 2). Introducing the nitrogen cycle to an ESM generally weakens the β feedback on land carbon fluxes at high CO2 concentrations (Figs. 5d, S22) because ecosystem nitrogen contents cannot keep up with the increased photosynthetic production, thus limiting carbon assimilation (Arora et al. 2020). Among ESMs, CanESM5 and CNRM-ESM2-1 that do not have explicit nitrogen cycle module have the highest CO2 concentration-driven carbon flux estimates at high CO2 concentration levels (Fig. S22).

Second, SCMs may estimate higher land carbon uptake per GSAT unit change due to their insufficient carbon-climate (γ) feedback on the carbon cycle driven by historical constraining. Although historical land carbon uptake has been largely influenced by the β feedback (Tharammal et al. 2019), its might decrease with future warmer temperatures and other limitation (Figs. 5 and S22). While SCMs have been historically constrained against observations (as indicated in Table 1 and Section 3.5.3), incorporating γ feedback into these models is more challenging since there is relatively limited observational evidence of its global-scale effects.

The inter-model spread in the NBP estimates by ESMs can itself be largely explained by the uncertainty in the γ feedback in the tropics (Fig. S23). Furthermore, unlike the β feedback, which represents carbon flux responses to changes in CO2 concentration, the γ feedback responds to temperature changes, considering not only CO2 but also other non-CO2 GHGs and biogeophysical effects of land cover (Melnikova et al. 2023). Future SSP scenarios encompass diverse changes in non-CO2 GHG concentrations that may further affect the response of the γ feedback. This adds to uncertainty in the estimates of future land carbon fluxes by ESMs.

3.5.5 Carbon flux estimates under low CO2 concentration and overshoot scenarios

Under low-concentration (mitigation and overshoot) pathways, there is a reasonable agreement between SCMs and ESMs regarding the response of compatible FF estimates to changes in CO2 (Fig. 6). However, this agreement between SCMs and ESMs is based on an incorrect premise. SCMs estimate a larger hysteresis in the response of ocean carbon uptake to CO2 and GSAT (global surface air temperature) changes, while ESMs estimate a larger hysteresis in the response of NBP (net biome productivity) to CO2 and GSAT changes.

The discrepancy in NBP is already evident in the ramp-up phase of peak and decline scenarios, due to the reasons outlined in the previous sections. The discrepancy amplifies to the extent that SCMs and ESMs exhibit opposite directions of hysteresis under the SSP4-3.4 scenario (Fig. 6d, h, k). This SSP scenario is characterized by intricate and dynamic land-cover changes (Fig. 1b, Table 3). It is worth noting that LUC emissions play an even more significant role in low-concentration scenarios due to their larger contribution to the total emissions. Moreover, many of these scenarios rely on various land-based mitigation strategies. Unfortunately, the framework of this study does not permit an in-depth exploration of the underlying reasons for the discrepancies in land carbon uptake estimated by ESMs and SCMs. Hence, we recommend that future studies thoroughly compare the carbon cycles of ESMs and SCMs under overshoot and mitigation scenarios.

The estimates of ocean carbon uptake by SCMs and ESMs exhibit reasonable agreement during the ramp-up phase of overshoot scenarios. However, in the ramp-down phases of peak and decline scenarios, SCMs consistently display a larger hysteresis in the response of ocean carbon uptake to changes in CO2 and GSAT compared to ESMs. This discrepancy may be attributed to the faster carbon mixing in the ocean, as discussed in the following section.

3.5.6 Mixing of carbon in the ocean

The differences between ESMs and SCMs also emerge in the response of ocean carbon flux to the CO2 growth rate, CO2 concentration, and GSAT (Fig. 4). The increase in cumulative ocean carbon uptake with increasing CO2 growth rate, CO2 concentration, and GSAT diminishes in the ESMs but not SCMs. The discrepancies in the ocean carbon flux may be attributed to the nonlinearities of the carbon-concentration and carbon-climate feedbacks that are the changes in the carbon storage in response to the changes in CO2 concentration and GSAT, respectively (Gregory et al. 2009; Schwinger and Tjiputra 2018; Melnikova et al. 2021). The response of the β feedback in the ocean is complex, with change in CO2 growth rate dominating the flux variability on year-to-year timescales (i.e., system response to the forcing rate of change) and change in CO2 concentration dominating the variability on decadal timescales (i.e., system response to the forcing magnitude) (Schwinger and Tjiputra 2018; Melnikova et al. 2021). Thus, while the ocean carbon storage response to the forcing rate of change is nearly equal among the two types of models, the storage response to the forcing magnitude is delayed in SCMs. We speculate that the discrepancy may be rooted in a faster mixing of carbon from the surface to the deep ocean in SCMs. Schwinger and Tjiputra (2018) showed that the carbon uptake of water masses of different ages exhibits varying degrees of hysteresis in response to changes in CO2 concentrations. The younger water masses show considerably less hysteresis in the carbon uptake than the deep ocean water masses. Similarly, the ocean carbon uptake increases less with increasing GSAT and starts to decrease sooner with decreasing GSAT under climate change mitigation overshoot-like scenarios when estimated by ESMs compared to SCMs. We recommend that the discrepancies in the response of ocean carbon uptake by ESMs and SCMs are further investigated using the set of idealized experiments, including those that have abrupt CO2 increases and overshoot (ramp-up and ramp-down) scenarios.

4 Conclusion

This study investigates the differences in the carbon cycle projections calculated by ESMs and SCMs during the historical period and under future SSPs. First, we evaluate models’ estimates of land and ocean carbon fluxes during the historical period. Second, we analyze the discrepancy in the future land and ocean carbon uptake estimated by ESMs and SCMs that emerges due to structural differences, as well as model outliers that impact ensemble means. Although existing evidence did not allow to scrutinize thoroughly the reasons for discrepancies in carbon cycle responses between ESMs and SCMs, we propose some likely explanations. To better align the carbon cycle projections between ESMs and SCMs, we put forward a set of recommendations regarding features of models that can be further developed: (1) mixing of carbon from the surface to the deep ocean; (2) future carbon-concentration feedback, influenced by nitrogen limitation of photosynthesis, and carbon-climate feedback; (3) representation of LUC emissions in the models; and (4) historical calibration of SCMs. Many carbon removal technologies are land-based and require land use in one form or the other. In a world with decreasing FF emissions, how land is used becomes increasingly important, and it is crucial to improve the consistency between ESMs and SCMs with respect to LUC emissions.

Improving the carbon cycle features of SCMs will improve applicability of SCMs as ESM emulators for both climate and carbon cycle projections and thus advance the developments of AR6. Particularly, we highlight the deviations in the estimates of the carbon cycle dynamics by SCM MAGICC, which is widely used to convert the IAM emissions to concentrations that are used by ESMs. Furthermore, we draw attention to the inconsistencies between the LUC emission estimates reported by ESMs and IAMs. These inconsistencies arise during the data conversion from IAMs to ESMs, as well as from the differences in the definitions of LUC emissions. These definitions do not always account for the same processes and components. The differences in carbon flux estimates between SCMs and ESMs are particularly large for low CO2 concentration and overshoot scenarios. This has important implications as these scenarios are relevant to policy analyses that are often supported by SCM simulations. We call for carrying out more SCMs–ESMs (and potentially IAMs) joint studies and inter-model comparison exercises, to explore future biogeochemical feedbacks related to land and ocean-based mitigation options, pursue more consistent reporting of various carbon fluxes, and seek a higher consistency between models used to generate scenarios.