Introduction

Soil organic matter is the largest terrestrial pool of actively-cycling carbon, with the potential for large carbon-feedbacks in response to climate and land-use change (Friedlingstein et al. 2014; Todd-Brown et al. 2014; Wieder et al. 2018). The storage and persistence of soil organic carbon (SOC) exhibit considerable spatial heterogeneity and are governed by multiple key factors, including climate, plant productivity, edaphic properties, and disturbance history (Doetterl et al. 2015; Jackson and Caldwell 1993; Jackson et al. 2017; Rasmussen et al. 2018; Sollins et al. 1996). However, it remains unclear which of these predictors dominate in distinct biomes and across spatial scales. Furthermore, as soil biogeochemical models continue to evolve, benchmarking techniques that validate not only the soil carbon stocks, but also the environmental sensitivity of soil carbon pools are increasingly needed (Sulman et al. 2018; Wieder et al. 2019a).

Soil biogeochemical models vary widely in structure and, consequently, in predicted carbon stocks and response to future perturbations (Abramoff et al. 2018; Georgiou et al. 2017; Li et al. 2014; Sulman et al. 2018). For decades, soil and ecosystem model development largely reflected traditional ideas of inherent chemical recalcitrance as a dominant control, and, consequently, the role of plant productivity and climate in regulating SOC storage (Schmidt et al. 2011; Wang et al. 2010). In recent years, however, emerging ideas place a growing emphasis on the importance of microbial activity and mineralogical properties (Cotrufo et al. 2013; Dwivedi et al. 2019; Grandy and Neff 2008; Lehmann et al. 2020) that have since been incorporated explicitly in process-based soil carbon models (Allison et al. 2010; Robertson et al. 2019; Sulman et al. 2014; Wang et al. 2013; Wieder et al. 2015a, b). Representing these evolving theories in mathematical models generates novel emergent behavior and alternative hypotheses about the relative importance of key state factors and their interactions. To adequately compare such models, it is important to recognize that there are numerous sources of uncertainty that can cause apparent downstream divergences (Sulman et al. 2018). Thus, it is helpful to force models with identical inputs and environmental conditions when comparing SOC stocks and their response to perturbations (e.g., within a model testbed; Wieder et al. 2018; Ahlström et al. 2017), to reduce sources of uncertainty that may confound the interpretation of results.

While the global distribution of SOC stocks may look reasonably similar across models and measurements (Fig. 1), it is difficult to diagnose where the models predict correctly and whether they do so for the right reasons. Indeed, recent studies that compare SOC stocks from models against long-term field manipulations have concluded that there are still large uncertainties in both the models and observations, and that novel observational syntheses and benchmarking techniques are needed (Sulman et al. 2018; Wieder et al. 2019b). Diagnosing model temperature and moisture sensitivities, for example, can be an informative exercise for parameterizing and improving model climate sensitivities (Abramoff et al. 2019; Koven et al. 2017; Wieder et al. 2018). To examine a broader suite of controls that includes soil mineralogy and plant productivity, however, requires novel diagnostics. We also note that global soil data products used for model comparisons and benchmarking are often statistically-derived themselves (i.e., extrapolated from point observations) and may not carry forward the same underlying relationships with key environmental controls (Batjes 2016; Hengl et al. 2014). Therefore, it is critical to explore the environmental controls of SOC stocks in soil profiles, in addition to the controls of SOC stocks in commonly used global observationally-derived data products and biogeochemical models. Furthermore, dominant predictors may vary geographically across biomes in observations and biogeochemical models.

Fig. 1
figure 1

Soil organic carbon observations and observationally-derived data products (a, b, c) and process-based soil model outputs (d, e, f) used in this study. Observations are from the a ISRIC WoSIS database soil profiles, b Harmonized World Soil Database (HWSD), and c Northern Circumpolar Soil Carbon Database (NCSCD) adjusted HWSD. Model outputs are from the d CASA-CNP, e MIMICS, and f CORPSE soil biogeochemical models. Soil organic carbon (SOC) stocks are shown for all globally gridded data products and soil model outputs

Here, we present a novel benchmarking approach that leverages machine-learning-based diagnostics. We explore key predictors of SOC stocks and how they differ between (i) soil profiles, (ii) observationally-derived data products, and (iii) biogeochemical models, globally and within biomes (Fig. 1). Specifically, we focus on climate, vegetation, and mineralogical controls that were present in all of the biogeochemical models compared in this study – namely, mean annual temperature (MAT), net primary production (NPP), and soil clay and silt content (i.e., soil texture; TEX). We anticipate that the predictability of SOC stocks with these select predictor variables will be lower for soil profiles and observationally-derived data products, compared to outputs derived from biogeochemical models that have an imposed underlying structure. Models represent SOC dynamics as a function of only a select few input variables, and consequently model output may depict a lower variability in SOC stocks under a given set of environmental conditions compared to field measurements. Indeed, while other environmental controls are likely important for explaining SOC variability in field measurements (e.g., pH, precipitation, fire, nutrients), they are not yet incorporated into all of the soil biogeochemical models presented herein and are thus not compared in this benchmarking analysis. Evaluating SOC stocks and their dominant controls in global scale models is a critical step towards improving the understanding, model representation, and projection of factors controlling soil carbon persistence and potential responses under environmental change. Given temporal limitations in broad-scale SOC measurements, novel diagnostics using spatial gradients present opportunities to explore environmental sensitivities and long-term responses to perturbations.

Methods

Global observations

In this study, we used two types of observational and observationally-derived datasets: (i) a compilation of soil profile measurements and (ii) observationally-derived gridded soil data products. Specifically, the ISRIC global soil database (World Soil Information System, WoSIS) was used for soil profile measurements and corresponding climate variables (n > 10,000 profiles) (Batjes 2009, 2016) (Fig. 1). For each soil profile, we summed SOC stocks to 1 m, and excluded profiles that were shallower than 80 cm, to best match the 1 m depth of the gridded soil data products and model outputs. For gridded soil data products, we used the Harmonized World Soil Database (HWSD) at 0.5° x 0.5° resolution (Wieder et al. 2014b), but also ensured that our results were robust to the selected resolution (Table 1). Given that HWSD SOC stocks may be biased at high latitudes because of fewer observations, we also used a combined gridded data product where the Northern Circumpolar Soil Carbon Database (NCSCD) was used to replace HWSD values where overlap occurs (Hugelius et al. 2013). In our analyses, this product is called the NCSCD-adjusted HWSD (Fig. 1).

For gridded observational data we used soil texture (TEX; i.e., clay and silt content) from the HWSD at 0.5° x 0.5° resolution. For the soil profiles, clay and silt content were reported directly in WoSIS and we used the corresponding profile averages in our analyses. Estimates of plant productivity were derived as 10-yr averages (2000–2010) from the MODIS net primary productivity (NPP) product at 0.5° x 0.5° resolution and at soil profile locations (Koven et al. 2017; Zhao et al. 2005). We used satellite-derived NPP estimates for all observational products in this analysis, but note that our findings were qualitatively insensitive to the choice of NPP, including using simulated NPP from the biogeochemical testbed forcing as used for the model output. Mean annual temperature (MAT) was estimated as a 10-yr average (2000–2010) from the CRU dataset at 0.5° x 0.5° resolution and at soil profile locations (Harris et al. 2014). For a land classification map, we used the MODIS MCD12C1 landcover product for year 2010 at 0.5° x 0.5° resolution (Friedl et al. 2010).

Models and simulations

We explored the primary controls of three global-scale soil carbon models: CASA-CNP (Carnegie-Ames-Stanford Approach model; Potter et al. 1993; Wang et al. 2010), MIMICS (MIcrobial-MIneralization Carbon Stabilization model; Wieder et al. 2015b), and CORPSE (Carbon, Organisms, Rhizosphere, and Protection in the Soil Environment model; Sulman et al. 2014) (Fig. 1; Table 1). These soil models form the foundation of the soil biogeochemical testbed (Wieder et al. 2018) and were chosen to represent different mechanistic representations actively used in global land models. We briefly describe the testbed and highlight features of each soil model, but refer readers to the papers above for more detailed descriptions regarding specific model assumptions, structures, and parameterizations.

Our analysis used soil carbon stocks that were simulated from soil models in the biogeochemical testbed. All three soil carbon models in the testbed were forced with identical inputs and environmental conditions, thereby isolating the effect of underlying structural (and associated parametric) uncertainty. The testbed simulations used daily air temperature, gross primary production, soil temperature, and soil moisture that were generated by the Community Land Model (CLM version 4.5). The CLM4.5 simulations used satellite phenology and atmospheric forcing from the CRU-NCEP climate reanalysis from 1901 to 2010 (see Oleson et al. 2013). These simulations generated globally-gridded, daily data that were needed to run the CASA-CNP vegetation model (Randerson et al. 1996; Wang et al. 2010). Although CASA-CNP can simulate coupled carbon, nitrogen, and phosphorus biogeochemistry, here we used the carbon-only version of the model. The vegetation model in CASA-CNP calculates autotrophic respiration fluxes, allocation of carbon to different plant tissues, and the timing of senescence and litterfall carbon inputs to the soil models (CASA-CNP, MIMICS, and CORPSE). Thus, in these carbon-only simulations, the soil models experienced identical timing and magnitude of litter inputs and soil abiotic conditions (temperature, moisture, and texture). Additional details for the simulation are described in (Wieder et al. 2018, 2019a).

Litterfall inputs provide fresh carbon substrates into litter pools that decompose into soil carbon pools in each model. CASA-CNP follows a conventional decomposition scheme that uses first-order decay of each carbon pool. In MIMICS and CORPSE, litter and soil decomposition occur when organic matter passes through one (CORPSE) or one of two (MIMICS) microbial biomass pools before (re)forming soil carbon, with rates of decomposition determined by substrate availability and the size of microbial biomass pools. Thus, MIMICS and CORPSE explicitly represent soil microbial activity and consider microbial interactions with the surrounding physicochemical soil environment. In all three models, the decomposition of organic matter follows temperature-sensitive kinetics, although the specific parametrization of each model results in distinct emergent temperature sensitivities of SOC turnover (see Wieder et al. 2018). All three models in the testbed assume that protected C pools are either inherently resistant to decomposition (e.g., long turnover times reflecting theories about the inherent chemical recalcitrance of passive C in CASA-CNP) or have restricted access to microbial decomposers (as in the protected pools simulated by MIMICS and CORPSE). Despite this distinction in theory, all three models in the testbed also use soil texture as a proxy that mediates the persistence of this passive or protected organic matter (Bailey et al. 2019; Rasmussen et al. 2018). Soil carbon stocks simulated by each model – CASA-CNP, MIMICS, and CORPSE – total 1380, 1420, and 1720 Pg C globally (0-100 cm depth; 10-yr average from 2000 to 2010), respectively, and, broadly, have similar spatial distributions (Fig. 1; Wieder et al. 2018).

Machine learning emulators and analyses

We used statistical modeling to identify key predictors influencing SOC variability and explored a suite of approaches, including multivariate linear regressions, gradient boosting machines, and random forests (Fig. S1). We trained the models using SOC content and corresponding predictors—here, MAT, NPP, and TEX—for each data source, globally and across land cover types. The random forest (RF) models are reported here, with results from the multivariate linear regressions shown in the supplement (Fig. S2; Table S1). For the RF results, the percent variance explained (on independent test data, with a 75 − 25 train-test split; Fig. S3-S5) and variable importance scores were averaged from an ensemble of 10 random forests (400 decision trees each) with bootstrapped sampling, which were sufficient for convergence and stable model results. All RF analyses were performed using the R package randomForest (version 4.6–12) (Breiman 2001; Liaw and Wiener 2002).

Variable importance scores depict the degradation in model performance, i.e., the increase in mean squared error (MSE), following the exclusion of a given predictor from the RF model and were normalized to sum to 1. Namely, in the case of two hypothetical predictors \({x}_{j}\; {\text{and}}\; {x}_{k}\), the variable importance (\(VI\)) of predictor \({x}_{k}\) can be written as

$${VI}_{k}=\frac{\varDelta {MSE}_{k}}{\varDelta {MSE}_{k}+\varDelta {MSE}_{j}}=\frac{MSE\left({x}_{j}\right)-MSE\left({x}_{j}, {x}_{k}\right)}{\left[MSE\left({x}_{j}\right)-MSE\left({x}_{j}, {x}_{k}\right)\right]+\left[MSE\left({x}_{k}\right)-MSE\left({x}_{j}, {x}_{k}\right)\right]}$$
(1)

and that of predictor \({x}_{j}\) can be written as

$${VI}_{j}=\frac{\varDelta {MSE}_{j}}{\varDelta {MSE}_{k}+\varDelta {MSE}_{j}}=\frac{MSE\left({x}_{k}\right)-MSE\left({x}_{j}, {x}_{k}\right)}{\left[MSE\left({x}_{j}\right)-MSE\left({x}_{j}, {x}_{k}\right)\right]+\left[MSE\left({x}_{k}\right)-MSE\left({x}_{j}, {x}_{k}\right)\right]}$$
(2)

where \(\varDelta {MSE}_{k}\) is the increase in mean squared error when \({x}_{k}\) is removed from the RF model compared to the model with all variables included, and analogously for \(\varDelta {MSE}_{j}\) with the removal of \({x}_{j}\). This increase in mean squared error when a given variable is removed means that, in the case of \({x}_{k}\) for example, \(\varDelta {MSE}_{k}>0\) since \(MSE\left({x}_{j}\right)>MSE\left({x}_{j}, {x}_{k}\right)\), where \(MSE\left({x}_{j}\right)\) is the mean squared error when \({x}_{k}\) is removed from the RF model and \(MSE\left({x}_{j},{x}_{k}\right)\) when all variables are included. The variable importance scores then sum to 1 across predictors; that is, \({VI}_{k}+{VI}_{j}=1\) in the example above. The importance ranking is linear, so it is preserved in the normalization. Importance scores reflect how important each predictor is in explaining the spatial variability of SOC for each data source, where a higher score signifies a greater importance (Figs. S6-S8).

Partial dependence relationships were used to explore the effect of each climatic and edaphic predictor variable on SOC stocks from each data source, while the other predictor variables were held constant at their mean value. These relationships emerged from the RF emulators, without imposing expected relationships a priori. For comparison, we also include results from a multivariate linear regression (Fig. S2; Table S1), which show similar qualitative relationships without the potential for emergent non-linearities. Standardized regression beta coefficients are given for the multivariate linear regressions from each data source, where the underlying data have been standardized so that the variances of dependent and independent variables are equal to 1. This allows the comparison of regression coefficients on the same scale. However, the empirical and modeled relationships are known to be non-linear for given predictors (e.g., with temperature), and thus we focus our study on these results and urge the adoption of methods that allow non-linearities to emerge.

Biome-specific analyses were conducted on a subset of the global datasets for each data source. Using the MODIS MCD12C1 landcover product (Friedl et al. 2010) for classification, we first grouped forest (deciduous/evergreen broad/needleleaf and mixed forests; i.e., IGBP land classes 1 to 5) and herbaceous (savannas and grasslands; i.e., IGBP land classes 9 and 10) biomes for broad landcover comparisons, with the HWSD and NCSCD-adjusted gridded data products and biogeochemical model output. We then explored underlying biomes within these broad categories (Fig. S6-S7). We excluded the soil profiles from these biome-specific analyses due to low predictability (variance explained < 20 %) and higher uncertainty in landcover classification, focusing instead on the gridded data products and biogeochemical model outputs (Fig. S6-S7). We trained random forest models on each data subset to learn the role of the underlying climatic and edaphic predictors on SOC stocks, and compared the percent variance explained and variable importance for each biome subset across the data sources.

Results and Discussion

Predictability and variance explained

The predictability of SOC varies substantially between observations and process-based models and across scales. Indeed, using random forest models with key state factors—namely, MAT, NPP, and TEX—to predict SOC from different sources showed a range of variance explained, from 26 % for global soil profile measurements up to 81 % for the MIMICS model (Table 1). The soil profile measurements were taken at the plot-level and exhibited greater heterogeneity due to fine-scale controls (e.g., proximity to vegetation, topographic wetness, soil aggregation; Wiesmeier et al. 2019) that are often not included in broad-scale analyses or models. As a result, the predictability of such SOC measurements is often lower (Doetterl et al. 2015; Hengl et al. 2014, 2017)—here, only 26 % of the variance in soil profiles is explained by MAT, NPP, and TEX using a random forest model, indicating that important controls at this plot-scale are still missing and that local conditions that drive SOC accumulation may not be well-matched with the relatively coarse spatial scales of MAT and NPP. Future studies should thus consider using finer-scale observationally-derived products of covariates, when feasible and available. Furthermore, other fine-scale controls (e.g., pH, exchangeable calcium, extractable iron and aluminum, among other properties) are important in explaining the variability in SOC across soil profiles (Heckman et al. 2020; Nave et al. 2021; Rasmussen et al. 2018), but these controls are not yet incorporated in most biogeochemical models and, therefore, could not be compared herein.

Table 1 Comparison of scales and predictability of soil organic carbon stocks from observations (soil profiles, HWSD, NCSCD adj. HWSD) and biogeochemical models (CASA-CNP, MIMICS, CORPSE)

In contrast, observationally-derived data products (e.g., HWSD and NCSCD-adj. HWSD) and process-based biogeochemical models have imposed underlying structures that rely on fewer input variables and are more easily learned by a random forest algorithm. Indeed, these gridded observational and model products exhibit less fine-scale heterogeneity in SOC stocks than that seen in soil profile measurements. Using a random forest model for each of the observational and biogeochemical model products globally, the same three covariates – MAT, NPP, and TEX – explained 58 and 53 % of the variance in the HWSD and NCSCD-adj. HWSD versus 63 %, 81 %, and 53 % in the CASA-CNP, MIMICS, and CORPSE model outputs, respectively (Table 1). The predictability of SOC also varied between individual biomes and broad biome classifications (Fig. S4-S5). For example, temperate deciduous broadleaf and evergreen needleleaf forests showed greater predictability than tropical evergreen broadleaf forests, in both observational data products and process-based models (Fig. S5).

While we focused our analyses on the more parsimonious RF model that included MAT, NPP, and TEX, we briefly note that adding mean annual precipitation (MAP) into the random forest models only marginally increased the percent variance explained by 1–5 % across the different data sources globally (Table S2). The smallest difference was observed for MIMICS, agreeing with the fact that the version of the model used in the biogeochemical testbed did not contain a soil moisture function and thus did not use MAP as an input (Wieder et al. 2018). Therefore, subsequent analyses focused on the environmental variables (MAT, NPP, TEX) that were direct inputs to all of the biogeochemical models.

Variable importance

In addition to overall SOC predictability across data and models, we also explored the variable importance and emergent relationships for MAT, NPP, and TEX for each of the SOC sources. We found an apparent mismatch between observations and biogeochemical models, where TEX was the most important predictor for the observations (both soil profiles and global datasets), in contrast to MAT and NPP for the model outputs (Fig. 2). This result also holds for clay as a predictor, as often used in biogeochemical models (Fig. S8; though only 48 % of the variance in HWSD was explained). Furthermore, we note that the importance of MAT was higher in the NCSCD-adj. HWSD compared to the HWSD alone, suggesting potential differences in underlying temperature sensitivity and freeze-thaw dynamics that are important to consider when benchmarking biogeochemical models globally.

Fig. 2
figure 2

Variable importance for soil organic carbon (SOC) stocks globally, using Random Forests (RFs) that learned the SOC from observations and biogeochemical models with net primary productivity (NPP), mean annual temperature (MAT), and soil texture (TEX). Results shown are averaged over an ensemble of RFs with bootstrapped sampling, and box plots depict the 95 % confidence intervals (error bars) and first and third quartiles (boxes). Variable importance equal to 1 indicates the highest importance

While biogeochemical models placed a higher importance on both temperature and plant productivity globally (Fig. 2), greater nuance exists at the biome-level. In our RF emulators of the biogeochemical model outputs, MAT stood out as the most important variable in forests, whereas NPP was most important in herbaceous biomes (Fig. 3). In contrast, TEX was again the most important variable for HWSD in both forest and herbaceous biomes (Fig. 3; Fig. S6), though the NCSCD-adj. HWSD showed an increased importance of MAT for herbaceous biomes at high-latitudes (Fig. 3; Fig. S6). The RF models performance in broad forest and herbaceous land classes were comparable to that of the global results (Fig. S4). Interestingly, among individual biomes where the RF also achieved similar performance to the global results (Fig. S5), we find agreement between the observations and model outputs in select biomes (Fig. S7). Namely, MAT had high explanatory power in temperate deciduous broadleaf and mixed forests and NPP had high explanatory power in temperate grasslands. This suggests that biogeochemical models can match observations in select biomes (e.g., temperate forests and grasslands), but targeted model improvements are needed in biomes where observations and models show divergent controls on SOC stocks (Fig. S6-S7). These targeted improvements could include a closer examination of model parameterizations in specific biomes (for example, increasing mineralogical controls or increasing/decreasing temperature dependencies) and the distribution of SOC among different model pools. Modifications may also require incorporating missing controls into biome-specific model formulations or conducting additional experiments and sampling campaigns in data-poor regions to further inform parameterizations.

Fig. 3
figure 3

Variable importance for soil organic carbon (SOC) stocks in a forest and b herbaceous biomes, using Random Forests (RFs) that learned the SOC from observations and biogeochemical models with net primary productivity (NPP), mean annual temperature (MAT), and soil texture (TEX). Forests included deciduous/evergreen broadleaf/needleleaf and mixed forests, and herbaceous biomes included grasslands and savannas. Results shown are averaged over an ensemble of RFs with bootstrapped sampling, and box plots depict the 95 % confidence intervals (error bars) and first and third quartiles (boxes). Variable importance equal to 1 indicates the highest importance

Predictive relationships

Mineralogical controls

Using random forest models trained on the observations and model output, we explored the dependence of measured and modeled SOC on individual covariates. These random forest models act as emulators of the observations and biogeochemical models, allowing us to explore individual relationships (i.e. partial dependence plots) while controlling for all other covariates. This method also allows for emergent non-linearities without a priori imposed relationships.

We observed stark differences between the observations and biogeochemical models (Fig. 4). Both the soil profiles and observational data products (HWSD and NCSCD-adj. HWSD) showed a greater dependence (steeper positive slope) on TEX (clay and silt content) compared to the models (near zero or negative slope; Fig. 4a; Fig. S2; Table S1). By contrast, the soil biogeochemical models were more sensitive to MAT and NPP, compared to the observational products. Though the soil biogeochemical models do contain some representation of mineral-organic associations (where MIMICS and CORPSE do so in a more explicit, mechanistic way), the importance of TEX appeared insignificant in explaining SOC content globally. Rather, climate and vegetation seemed to play a predominant role in driving the distribution of SOC in soil biogeochemical models. This discrepancy motivates a closer look at how mineralogical controls are implemented in each biogeochemical model and the resulting distributions of SOC stocks between modeled pools.

Fig. 4
figure 4

Partial dependence relationships of soil organic carbon (SOC) as a function of individual controlling factors, namely a soil texture (TEX), b mean annual temperature (MAT), and c net primary productivity (NPP). Machine-learning (Random Forest; RF) derived global relationships of individual covariates conditional on all other covariates being held constant at their mean. RFs were trained on each observational dataset and model output separately. Points show RF predictions; best fit lines show 99 % confidence intervals on the relationship

Temperature sensitivity

Both observations and models showed the highest SOC in grid cells with colder MAT (Fig. 4b). Indeed, SOC had the highest temperature sensitivity (steeper slope) in colder regions (MAT < 0 °C) and a lower temperature sensitivity (shallow slope) in warmer regions (MAT > 10 °C) (Fig. 4b), corroborating recent studies (Koven et al. 2017; Wieder et al. 2019a). This higher temperature sensitivity of SOC at low MAT primarily occurs in high-latitude forests, and therefore, a clear difference in temperature sensitivity was seen across herbaceous (here grasslands and savannas) and forest biomes (Fig. 3; Fig. S6). Without the NCSCD-adjusted values at high latitudes, the HWSD data product on its own showed a muted temperature sensitivity, far lower than that of the soil profiles and biogeochemical models (Fig. 4b), suggesting that the HWSD alone may not fully capture the underlying temperature sensitivity that is needed to benchmark biogeochemical models. Replacing the high latitudes with the NCSCD data product, known to be more accurate in those geographies (Hugelius et al. 2013; Koven et al. 2017), improved the emergent temperature sensitivity such that it approached that of the soil profiles.

While the soil biogeochemical models use similar temperature functions, their parameterizations and which fluxes or carbon pools they are applied to differ between the models (Wieder et al. 2018). As a result, the biogeochemical models exhibited a range of temperature sensitivities (Fig. 4b). The degree of temperature sensitivity of the soil profiles most closely matched the MIMICS and CASA-CNP models (Fig. 4b). CORPSE had the highest MAT dependence (Fig. 4b), which may be due in part to its ceased decomposition in frozen soils (Wieder et al. 2019a), and is reflected by its higher SOC content at high latitudes (Fig. 1).

Influence of plant productivity

Plant inputs are the main source of carbon to the soil and, thus, plant productivity across biomes plays an important role in carbon accumulation. We found that, in both the observations and soil biogeochemical models, SOC content exhibited a saturating relationship with increasing net primary productivity (NPP; Fig. 4c). However, the soil profiles and the biogeochemical models showed a significantly higher sensitivity (steeper increase) to NPP than the HWSD data product. This suggests that underlying environmental sensitivities seen in the soil profiles may not be effectively represented in the HWSD data product, and thus, care should be taken when using such data products to benchmark soil biogeochemical models. Indeed, the soil profiles showed a NPP sensitivity that fell in the middle of those seen in the models; specifically, at high NPP, the soil profiles showed a lower sensitivity than CORPSE and higher sensitivity than MIMICS and CASA-CNP, while at low NPP, the soil profiles were less sensitive than all three of the models (Fig. 4c). This qualitative difference in the shape of the soil profile curve is interesting, and further investigation with other datasets (especially those with site-level NPP measurements) is warranted for verification – especially given the low variance explained for the soil profiles – and to understand the root cause and potential implications for model representations. It is important to note that gridded NPP (and MAT) estimates may differ from the actual conditions experienced by the soil profiles being measured, as NPP can be highly variable at the site-level within a grid cell. Elucidating the sensitivity of long-term SOC storage to changes in plant inputs, and to underlying site-level heterogeneity, is critical for understanding potential feedbacks to land-use and land-cover change globally.

Conclusions and implications

Our results suggest a potential mismatch between observations and biogeochemical models in the emergent role of environmental controls in explaining SOC variability. Namely, MAT and NPP have the most explanatory power for SOC stocks from the biogeochemical models – CASA-CNP, MIMICS, and CORPSE – and TEX has the least explanatory power. In contrast, TEX has the most explanatory power for SOC stocks among the observations, both at the soil profile scale and in the global observationally-derived data products (Fig. 2; Fig. S8). This mismatch motivates a closer look at how mineralogical controls modulate SOC persistence and microbial function in biogeochemical models, and their resulting distribution of mineral-associated organic carbon, which could improve projections of not only SOC stocks, but also SOC ages and turnover times.

Furthermore, the covariate relationships observed in soil profiles and observationally-derived gridded data products were not always aligned (e.g., Fig. 4b-c), indicating that gridded data products may not effectively capture the underlying climatic and edaphic controls of the measured soil profiles due in part to upscaling uncertainty. The NCSCD-adj. HWSD, however, was more closely aligned with the soil profiles than was the HWSD alone, supporting recent studies that preferentially use it for model comparisons (Koven et al. 2017). Other gridded observational products (Hengl et al. 2017) can also be evaluated in this way to understand their effective representation of emergent environmental controls. This is an important step in validating observational data products for future use as model benchmarks.

The overall predictability of SOC stocks from soil profiles and observational data products was generally lower than that of the biogeochemical models, globally and within select biomes (Table 1; Figs. S4-S5). While other predictors are certainly important for explaining the variability of SOC in soil profiles and observational data products (Delgado-Baquerizo et al. 2017; Doetterl et al. 2015; Rasmussen et al. 2018), these predictors are often not direct inputs into biogeochemical model simulations and therefore cannot be compared using the random forest emulator approach presented here. However, further exploration of the underlying, biome-specific and global controls driving SOC variability in observations (e.g., metal oxides, nutrients, fauna, topography; also MAP in Table S2) (Crowther et al. 2019; Doetterl et al. 2015; Rasmussen et al. 2018) is essential for motivating their incorporation in future process-based formulations of soil biogeochemical models.