Global observations
In this study, we used two types of observational and observationally-derived datasets: (i) a compilation of soil profile measurements and (ii) observationally-derived gridded soil data products. Specifically, the ISRIC global soil database (World Soil Information System, WoSIS) was used for soil profile measurements and corresponding climate variables (n > 10,000 profiles) (Batjes 2009, 2016) (Fig. 1). For each soil profile, we summed SOC stocks to 1 m, and excluded profiles that were shallower than 80 cm, to best match the 1 m depth of the gridded soil data products and model outputs. For gridded soil data products, we used the Harmonized World Soil Database (HWSD) at 0.5° x 0.5° resolution (Wieder et al. 2014b), but also ensured that our results were robust to the selected resolution (Table 1). Given that HWSD SOC stocks may be biased at high latitudes because of fewer observations, we also used a combined gridded data product where the Northern Circumpolar Soil Carbon Database (NCSCD) was used to replace HWSD values where overlap occurs (Hugelius et al. 2013). In our analyses, this product is called the NCSCD-adjusted HWSD (Fig. 1).
For gridded observational data we used soil texture (TEX; i.e., clay and silt content) from the HWSD at 0.5° x 0.5° resolution. For the soil profiles, clay and silt content were reported directly in WoSIS and we used the corresponding profile averages in our analyses. Estimates of plant productivity were derived as 10-yr averages (2000–2010) from the MODIS net primary productivity (NPP) product at 0.5° x 0.5° resolution and at soil profile locations (Koven et al. 2017; Zhao et al. 2005). We used satellite-derived NPP estimates for all observational products in this analysis, but note that our findings were qualitatively insensitive to the choice of NPP, including using simulated NPP from the biogeochemical testbed forcing as used for the model output. Mean annual temperature (MAT) was estimated as a 10-yr average (2000–2010) from the CRU dataset at 0.5° x 0.5° resolution and at soil profile locations (Harris et al. 2014). For a land classification map, we used the MODIS MCD12C1 landcover product for year 2010 at 0.5° x 0.5° resolution (Friedl et al. 2010).
Models and simulations
We explored the primary controls of three global-scale soil carbon models: CASA-CNP (Carnegie-Ames-Stanford Approach model; Potter et al. 1993; Wang et al. 2010), MIMICS (MIcrobial-MIneralization Carbon Stabilization model; Wieder et al. 2015b), and CORPSE (Carbon, Organisms, Rhizosphere, and Protection in the Soil Environment model; Sulman et al. 2014) (Fig. 1; Table 1). These soil models form the foundation of the soil biogeochemical testbed (Wieder et al. 2018) and were chosen to represent different mechanistic representations actively used in global land models. We briefly describe the testbed and highlight features of each soil model, but refer readers to the papers above for more detailed descriptions regarding specific model assumptions, structures, and parameterizations.
Our analysis used soil carbon stocks that were simulated from soil models in the biogeochemical testbed. All three soil carbon models in the testbed were forced with identical inputs and environmental conditions, thereby isolating the effect of underlying structural (and associated parametric) uncertainty. The testbed simulations used daily air temperature, gross primary production, soil temperature, and soil moisture that were generated by the Community Land Model (CLM version 4.5). The CLM4.5 simulations used satellite phenology and atmospheric forcing from the CRU-NCEP climate reanalysis from 1901 to 2010 (see Oleson et al. 2013). These simulations generated globally-gridded, daily data that were needed to run the CASA-CNP vegetation model (Randerson et al. 1996; Wang et al. 2010). Although CASA-CNP can simulate coupled carbon, nitrogen, and phosphorus biogeochemistry, here we used the carbon-only version of the model. The vegetation model in CASA-CNP calculates autotrophic respiration fluxes, allocation of carbon to different plant tissues, and the timing of senescence and litterfall carbon inputs to the soil models (CASA-CNP, MIMICS, and CORPSE). Thus, in these carbon-only simulations, the soil models experienced identical timing and magnitude of litter inputs and soil abiotic conditions (temperature, moisture, and texture). Additional details for the simulation are described in (Wieder et al. 2018, 2019a).
Litterfall inputs provide fresh carbon substrates into litter pools that decompose into soil carbon pools in each model. CASA-CNP follows a conventional decomposition scheme that uses first-order decay of each carbon pool. In MIMICS and CORPSE, litter and soil decomposition occur when organic matter passes through one (CORPSE) or one of two (MIMICS) microbial biomass pools before (re)forming soil carbon, with rates of decomposition determined by substrate availability and the size of microbial biomass pools. Thus, MIMICS and CORPSE explicitly represent soil microbial activity and consider microbial interactions with the surrounding physicochemical soil environment. In all three models, the decomposition of organic matter follows temperature-sensitive kinetics, although the specific parametrization of each model results in distinct emergent temperature sensitivities of SOC turnover (see Wieder et al. 2018). All three models in the testbed assume that protected C pools are either inherently resistant to decomposition (e.g., long turnover times reflecting theories about the inherent chemical recalcitrance of passive C in CASA-CNP) or have restricted access to microbial decomposers (as in the protected pools simulated by MIMICS and CORPSE). Despite this distinction in theory, all three models in the testbed also use soil texture as a proxy that mediates the persistence of this passive or protected organic matter (Bailey et al. 2019; Rasmussen et al. 2018). Soil carbon stocks simulated by each model – CASA-CNP, MIMICS, and CORPSE – total 1380, 1420, and 1720 Pg C globally (0-100 cm depth; 10-yr average from 2000 to 2010), respectively, and, broadly, have similar spatial distributions (Fig. 1; Wieder et al. 2018).
Machine learning emulators and analyses
We used statistical modeling to identify key predictors influencing SOC variability and explored a suite of approaches, including multivariate linear regressions, gradient boosting machines, and random forests (Fig. S1). We trained the models using SOC content and corresponding predictors—here, MAT, NPP, and TEX—for each data source, globally and across land cover types. The random forest (RF) models are reported here, with results from the multivariate linear regressions shown in the supplement (Fig. S2; Table S1). For the RF results, the percent variance explained (on independent test data, with a 75 − 25 train-test split; Fig. S3-S5) and variable importance scores were averaged from an ensemble of 10 random forests (400 decision trees each) with bootstrapped sampling, which were sufficient for convergence and stable model results. All RF analyses were performed using the R package randomForest (version 4.6–12) (Breiman 2001; Liaw and Wiener 2002).
Variable importance scores depict the degradation in model performance, i.e., the increase in mean squared error (MSE), following the exclusion of a given predictor from the RF model and were normalized to sum to 1. Namely, in the case of two hypothetical predictors \({x}_{j}\; {\text{and}}\; {x}_{k}\), the variable importance (\(VI\)) of predictor \({x}_{k}\) can be written as
$${VI}_{k}=\frac{\varDelta {MSE}_{k}}{\varDelta {MSE}_{k}+\varDelta {MSE}_{j}}=\frac{MSE\left({x}_{j}\right)-MSE\left({x}_{j}, {x}_{k}\right)}{\left[MSE\left({x}_{j}\right)-MSE\left({x}_{j}, {x}_{k}\right)\right]+\left[MSE\left({x}_{k}\right)-MSE\left({x}_{j}, {x}_{k}\right)\right]}$$
(1)
and that of predictor \({x}_{j}\) can be written as
$${VI}_{j}=\frac{\varDelta {MSE}_{j}}{\varDelta {MSE}_{k}+\varDelta {MSE}_{j}}=\frac{MSE\left({x}_{k}\right)-MSE\left({x}_{j}, {x}_{k}\right)}{\left[MSE\left({x}_{j}\right)-MSE\left({x}_{j}, {x}_{k}\right)\right]+\left[MSE\left({x}_{k}\right)-MSE\left({x}_{j}, {x}_{k}\right)\right]}$$
(2)
where \(\varDelta {MSE}_{k}\) is the increase in mean squared error when \({x}_{k}\) is removed from the RF model compared to the model with all variables included, and analogously for \(\varDelta {MSE}_{j}\) with the removal of \({x}_{j}\). This increase in mean squared error when a given variable is removed means that, in the case of \({x}_{k}\) for example, \(\varDelta {MSE}_{k}>0\) since \(MSE\left({x}_{j}\right)>MSE\left({x}_{j}, {x}_{k}\right)\), where \(MSE\left({x}_{j}\right)\) is the mean squared error when \({x}_{k}\) is removed from the RF model and \(MSE\left({x}_{j},{x}_{k}\right)\) when all variables are included. The variable importance scores then sum to 1 across predictors; that is, \({VI}_{k}+{VI}_{j}=1\) in the example above. The importance ranking is linear, so it is preserved in the normalization. Importance scores reflect how important each predictor is in explaining the spatial variability of SOC for each data source, where a higher score signifies a greater importance (Figs. S6-S8).
Partial dependence relationships were used to explore the effect of each climatic and edaphic predictor variable on SOC stocks from each data source, while the other predictor variables were held constant at their mean value. These relationships emerged from the RF emulators, without imposing expected relationships a priori. For comparison, we also include results from a multivariate linear regression (Fig. S2; Table S1), which show similar qualitative relationships without the potential for emergent non-linearities. Standardized regression beta coefficients are given for the multivariate linear regressions from each data source, where the underlying data have been standardized so that the variances of dependent and independent variables are equal to 1. This allows the comparison of regression coefficients on the same scale. However, the empirical and modeled relationships are known to be non-linear for given predictors (e.g., with temperature), and thus we focus our study on these results and urge the adoption of methods that allow non-linearities to emerge.
Biome-specific analyses were conducted on a subset of the global datasets for each data source. Using the MODIS MCD12C1 landcover product (Friedl et al. 2010) for classification, we first grouped forest (deciduous/evergreen broad/needleleaf and mixed forests; i.e., IGBP land classes 1 to 5) and herbaceous (savannas and grasslands; i.e., IGBP land classes 9 and 10) biomes for broad landcover comparisons, with the HWSD and NCSCD-adjusted gridded data products and biogeochemical model output. We then explored underlying biomes within these broad categories (Fig. S6-S7). We excluded the soil profiles from these biome-specific analyses due to low predictability (variance explained < 20 %) and higher uncertainty in landcover classification, focusing instead on the gridded data products and biogeochemical model outputs (Fig. S6-S7). We trained random forest models on each data subset to learn the role of the underlying climatic and edaphic predictors on SOC stocks, and compared the percent variance explained and variable importance for each biome subset across the data sources.