Soil organic carbon fractions in the Great Plains of the United States: an application of mid-infrared spectroscopy

Spectroscopy is a powerful means of increasing the availability of soil data necessary for understanding carbon cycling in a changing world. Here, we develop a calibration transfer methodology to appropriately apply an existing mid infrared (MIR) spectral library with analyte data on the distribution of soil organic carbon (SOC) into particulate (POC), mineral-associated (MAOC), and pyrogenic (PyC) forms to nearly 8000 soil samples collected in the Great Plains ecoregion of the United States. We then use this SOC fraction database in combination with a machine learning-based predictive soil mapping approach to explore the controls on the distribution of fractions through soil profiles and across the region. The relative abundance of each fraction had unique depth distribution profiles with POC fraction dropping exponentially with depth, the MAOC fraction having a broad distribution with a maxima at 35–50 cm, and the PyC fraction showed a slight subsurface maxima (10–20 cm) and then a steady decline with increasing depth. Within the Great Plains ecoregion, clay content was a strong control on the total amount and relative proportion of each fraction in both the surface and subsoil horizons. Sandy soils and soils in cool semi-arid regions contained significantly more POC relative to the MAOC and PyC fractions. Cultivated soils had significantly less SOC than grassland soils with losses following a predictable pattern: POC > MAOC ≫ PyC. This SOC fraction database and resulting maps can now form the basis for improved representation of SOC dynamics in biogeochemical models.

The relative abundance of each fraction had unique depth distribution profiles with POC fraction dropping exponentially with depth, the MAOC fraction having a broad distribution with a maxima at 35-50 cm, and the PyC fraction showed a slight subsurface maxima (10-20 cm) and then a steady decline with increasing depth. Within the Great Plains ecoregion, clay content was a strong control on the total amount and relative proportion of each fraction in both the surface and subsoil horizons. Sandy soils and soils in cool semiarid regions contained significantly more POC relative to the MAOC and PyC fractions. Cultivated soils had significantly less SOC than grassland soils with losses following a predictable pattern: POC [ MAOC ) PyC. This SOC fraction database and resulting maps can now form the basis for improved Introduction Soil organic carbon (SOC) is a large and not very well constrained pool in carbon cycle models. Earth System Models (ESM) included in the CMIP5 study estimated global SOC stocks to range from 510 to 3040 PgC (Todd-Brown et al. 2013) with estimated changes under a high climate forcing scenario ranging from a loss of 72 PgC to a gain of 253 PgC over the twentyfirst century (Todd-Brown et al. 2014). While there are numerous reasons for model-data disagreement, the lack of spatially comprehensive data on soil properties relevant to carbon stabilization and loss was recently cited as one of the most pressing research needs to improve SOC dynamics in ESMs (Luo et al. 2016). In particular, Luo et al. (2016) reported that developing databases with soil carbon pool information was the most commonly cited data limitation to ESM improvement.
Given that SOC is composed of a diverse set of organic compounds with greatly differing susceptibilities to microbial decomposition (Schmidt et al. 2011;Lehmann and Kleber 2015), knowledge of only the amount of SOM is insufficient to understand SOC response to environmental and anthropogenic disturbance (Lavallee et al. 2020). Lavallee et al. (2020) proposed that the division of SOC into particulate and mineral-associated organic carbon (POC and MAOC) results in two fundamentally different pools of SOC. POC is composed primarily of recognizable fragments of plant detritus with typical residence times of \ 5 years and can vary from only a few percent to upwards of 50% of total SOM depending on soil texture and management history (Cambardella and Elliott 1992;Baldock et al. 2013a). POC likely provides for much of the energy demands of the soil microbial community but its role in nutrient provision will vary depending on carbon-to-nutrient ratios (Janzen et al. 1992;Gregorich et al. 2006). Additionally, POC is considered to be a nexus of aggregate formation (Six et al. 2004), especially in sandy-textured soils, and therefore can contribute significantly, albeit indirectly, to various soil structural properties. On the other hand, MAOC typically makes up the majority of SOC (von Lützow et al. 2007;Baldock et al. 2013b) and is composed primarily of highly decomposed material, often microbial in origin (Miltner et al. 2012) and is stabilized by association with reactive mineral surfaces (Schmidt et al. 2011). MAOC with typical low carbon-to-nutrient ratios represents the major reservoir of potentially plant-available nutrients (Cambardella and Elliott 1994), contributes the majority of the negative charge attributable to SOC (Kleber and Johnson 2010), and can be an important binding agent between clay particles. Unfortunately, methodological approaches to dividing SOC into POC and MAOC by density separation or size fractionation fail to recognize that upwards of 20-50% of the total SOC pool can be composed of combustion residues of wild fires (Bird et al. 2015;Reisser et al. 2016). This pyrogenic carbon (PyC) is found distributed across both the POC and MAOC fractions (Baldock et al. 2013a;Leifeld et al. 2015). The PyC fraction has low nutritive value (Deluca et al. 2015), low energy yield and is thus slowly degraded (Liang et al. 2008), but has two important physical properties: significant microporosity and high negative charge (Kookana et al. 2011). The microporosity of PyC helps provide important habitat and refugia for soil microbes leading to greater functional diversity (Atkinson et al. 2010;Lehmann et al. 2011); while the negative charge has been shown to contribute greatly to cation exchange capacity (Liang et al. 2006), plant-water availability (Basso et al. 2013) and to reduced nutrient leaching (Yao et al. 2012).
Measuring SOC fractions clearly provide critical information on soil biogeochemical cycles but the work is extremely laborious especially when the PyC pool is also isolated (Baldock et al. 2013b). Diffuse reflectance mid infrared (MIR) spectroscopy has been used to successfully predict the distribution of SOC into fractions (Zimmermann et al. 2007;Baldock et al. 2013a;Knox et al. 2015) thereby greatly increasing the spatial coverage of SOC fraction data. Ahmed et al. (2017) used the Baldock et al. (2013b) models to explore the variation in POC and PyC across four western US states. Viscarra Rossel et al. (2019) also took advantage of the ability to model SOC fractions from MIR spectra and applied the calibration models developed by Baldock et al. (2013b) to ca. 6000 samples with MIR spectra to develop an understanding of how SOC fractions vary across the Australian continent.
As part of a larger investigation into land use and climate impacts on carbon cycling in the Great Plains ecoregion of the United States, here we present an application of efficient knowledge transfer using MIR spectroscopy and digital soil mapping in order to explore trends in the fractional allocation of SOC into particulate, mineral-associated and pyrogenic carbon pools across this broad geographic area. First, we document the spectroscopic and statistical developments necessary for applying existing carbon fraction models to a different soil MIR spectral library. In particular, we focus on calibration transfer between FTIR instruments and extension of the training set through spiking with locally-representative samples. Second, we apply these spectroscopy model developments to understand the variation in SOC fractions across the Great Plains ecoregion of the United States.

Site and data
The Great Plains ecoregion (Omernik and Griffith 2014) of the United States, covering almost 2.1 M km 2 across 11 States, was selected because of the strong climatic gradients (Table S2) but small variation in soil parent material and native vegetation and the mixture of land uses. With the exception of some localized river deposits, most soils in the Great Plains have formed on deep eolian and loess deposits or glacial till left behind as the Laurentide ice sheet retreated 12,000 years ago. Native vegetation in the Great Plains was dominated by prairie grasses trending from tallgrass to shortgrass assemblages along increasing aridity gradients from east to west. Most of the former tallgrass prairie has been converted to cropland over the past 150 years with corn, soy and wheat being the three dominant crops of the region. Much of the mixed-grass and shortgrass prairie is grazed by livestock. The USDA National Soil Survey Center-Kellogg Soil Survey Laboratory (NSSC-KSSL) has collected MIR spectra and associated laboratory data on nearly 14,000 soil samples from this region. About 8500 of these samples from 1565 soil profiles had associated geolocation data (Fig. S1).
The Australian soil samples and spectra available for SOC fraction model building were from a national survey of Australian agricultural soils reported in Baldock et al. (2013a, b) plus an additional 253 soil samples that were analyzed subsequent to those publications. These 565 soil samples had undergone a full fractionation procedure, outlined below, to determine the content of particulate, mineral associated and resistant carbon fractions.

Carbon fractionation
In total 99 new samples were fractionated using the full fractionation protocol as described in Baldock et al. (2013a). These samples were selected to represent the spectral diversity expected to be encountered across the Great Plains ecoregion of the United States. Specifically, a Principal Components Analysis was conducted on baseline transformed MIR spectra of all samples from the Great Plains ecoregion contained in the NSSC-KSSL spectral library. Then, a set of 100 samples were selected to best represent the diversity in the full dataset using the Kennard-Stone algorithm (Kennard and Stone 1969). The final set of samples comes from 11 states spanning 18 degrees in latitude and 20 degrees in longitude (Fig. S1). The median mid-point sample depth was 38 cm (25-75% quartile = 10-71 cm), median SOC and inorganic carbon (IC) concentrations were 0.9% (25-75% quartile = 0.5-2.3%) and 0.5% (25-75% quartile = 0-1.8%), respectively.
The fractionation procedure (Baldock et al. 2013a) separates fine earth (\ 2 mm) total organic carbon (TOC) into a particulate (POC) fraction, defined as organic material [ 50 lm excluding char-like material, a mineral-associated (MAOC) fraction, defined as organic material \ 50 lm excluding char-like material, and a pyrogenic (PyC) fraction defined as the organic material consisting of poly-condensed aromatic carbon determined by solid-state C-13 nuclear magnetic resonance (NMR) spectroscopy. We have adopted the MAOC terminology here while acknowledging there is no way to know if SOC in the fine size fraction is in fact mineral associated and most recent studies using the term MAOC have not distinguished between biogenic and pyrogenic C within this fraction. In brief, 10 g of the fine earth fraction was dispersed in a 5 g L -1 sodium hexametaphosphate solution, then separated into a coarse and fine size fraction using an automated wet sieve shaker (Fritsch Analysette 3 Spartan). Size fractions were freeze dried and organic carbon was determined on each size fraction by elemental analysis (Elementar Vario Max Cube) after removal of carbonates using sulfurous acid as necessary. Then, the size fractionation was repeated on 10-50 g of fine earth fraction depending on carbon content of fractions. These second size fractions were prepared for solid-state C-13 NMR spectroscopy by removal of minerals. For the coarse size fraction, the organics were floated off the sand in a large evaporating dish. For the fine size fraction, dilute (2%) hydrofluoric acid was used for demineralization and removal of paramagnetic interferences following the protocol of Skjemstad et al. (1994). NMR analyses were performed on the size fractions using a Bruker Avance 200 spectrometer equipped with a 4.7 T widebore superconducting magnet allowing the use of 7 mm diameter rotors which could contain up to 400 mg of sample. Cross polarization (CP) experimental parameters were optimized for each sample as described in Baldock et al. (2013a).
Not enough coarse fraction organic material was recovered for NMR on nine samples. Another 13 coarse fractions had unacceptably low signal from the CP NMR experiments so these NMR data were excluded from further analysis. For these 22 samples, we assumed that there was no resistant carbon in the coarse fraction. This is a conservative assumption and will not impact much upon the carbon mass balance since for these samples the great majority of carbon was found in the fine fraction. In fact, for these 22 samples, 91% of TOC was found, on average, in fine fraction. Good quality NMR spectra were acquired on all 99 fine fraction samples (Fig. S2). On an additional subset of fine (n = 20) and coarse (n = 7) fraction samples, direct polarization (DP) experiments were also performed to improve the relationship quantifying the underrepresentation of PyC in CP experiments (Eq. 12 in Baldock et al. 2013a). After combining these new DP data with existing Australian DP data on fractions, the new best fit for CP observability (/ CP ) was 0.552 (/ CP = 0.510 in Baldock et al. 2013a) ( Figure S3). This new / CP value was then applied to all US and Australian fraction data to come up with final carbon concentrations in fractions.

MIR measurement, model development and prediction
All Australian samples were finely milled and analyzed on a Thermo Nicolet 6700 FTIR spectrometer with a Pike autodiff diffuse reflectance accessory. Sixty co-added scans were acquired from 8000 to 400 cm -1 at a resolution of 8 cm -1 . Full operating conditions can be found in Baldock et al. (2013b). The 99 samples from the Great Plains were also analyzed using this spectrometer. The Great Plains samples, after milling under similar conditions, were all scanned on a Bruker Vertex 70 with a HTS-XT high throughput diffuse reflectance accessory at the KSSL in Lincoln NE. At the KSSL, all samples were scanned in quadruplicate with spectra recorded from 6000 to 400 cm -1 . Details can be found in Dangal et al. (2019). A set of 285 Australian soils (out of the 565 available with fraction data) were also scanned at the KSSL.
A memory-based learning (MBL) model developed by Ramirez-Lopez et al. (2013) as modified by Dangal et al. (2019) was used for all chemometrics. The MBL is a local or just-in-time modeling approach where a small number of most spectrally similar samples are chosen from the training set to build a model appropriate for each sample to be predicted. This parsimonious modeling approach eliminates the need for a priori subsetting of the training data by soil type or depth or any other factor and has been shown to outperform other modeling approaches on large spectral libraries (Dangal et al. 2019). In this study, we used a local partial least square regression (L-PLSR), where a target sample is predicted by performing the local cross validation against the user specified minimum (n = 10) and maximum (n = 100) number of spectrally similar neighbors. The spectrally similar neighbors corresponding to the target sample were selected using the Mahalanobis dissimilarity computed on the principal component space. We use local L-PLSR because the only parameter that needs to be optimized is the number of principal components, which is done by performing cross-validation at each local segment. Spectra were baseline transformed prior to model fitting. All fraction data were square root transformed prior to modeling but back-transformed before reporting goodness-of-fit statistics. Prior to model building an outlier screening step (Dangal et al. 2019) was employed where a PLSR model was developed to predict each fraction using all 585 ? 99 samples scanned at CSIRO and extreme outliers from the 1:1 line were removed. Five of the 99 Great Plains samples were identified as outliers in all models and thus excluded from further model development. Given the limited number of fully fractionated new Great Plains soil samples, models were evaluated using leave-one-out (LOO) cross validation instead of using an independent test set for validation.
First, we tested whether or not the original Australian training set could make accurate predictions of carbon fractions on the US soils. Next, we tested whether or not a model from just the newly fractionated US samples would perform well. Since there were only 94 samples, we applied a partial least squares regression model for each carbon fraction. Finally, the 94 fractionated soils from the Great Plains ecoregion were added to a database of 565 Australian soils that have undergone the full fractionation procedure. Using the MBL model with the 565 Australian soils as the training set, we found variable success in predicting the measured fractions on the 94 GP samples (Table 1). After spiking the training set with the GP samples, predictions of all fractions improved greatly suggesting (Table 1) that the inclusion of the local samples is necessary for accurate predictions. Additionally, for all three fractions, the spiked model outperformed the US-only model reducing the RMSE by an average of 25% (Table 1).
The second technical hurdle that needed to be addressed in order to use the Australian training set was to account for the fact that the NSSC-KSSL spectral library was acquired on a different instrument than the fraction model training set. The relative spectral response (y-axis) is known to vary slightly between optical benches for MIR measurements and various procedures have been proposed for transferring calibration models between instruments (Ge et al. 2011). Using the 285 AU and 94 US samples that were scanned on both instruments, a calibration transfer matrix was developed on first-derivative transformed spectra using the Piecewise Direct Standardization (PDS) approach (Bouveresse et al. 1996). Optimal transfer was achieved with a window size of three and retention of only one principal component. Table 2 indicates that good (i.e. high R 2 ) but slightly biased fits could be achieved without PDS, and that PDS always improved model fit with lower RMSE values for all carbon fractions.
Based on these results, we moved ahead with predictions on all GP samples that included GPS coordinates (n = 8500) using the MBL model on PDStransformed MIR spectra. Given we are now applying this model to a large number of samples without any further validation data, we needed some ways of detecting potential outliers. First, we employed an F-ratio test (Hicks et al. 2015) that identified spectra that fell outside of 95% of the calibration space. These samples may predict well but we would have little confidence in the predictions because they are unique to our calibration set. This screening resulted in the exclusion of 164 samples. For comparison, when the US-only model was applied to these 8500 samples, 571 samples were identified as F-ratio outliers.
Next, we compared the sum of the predicted fractions (POC ? MAOC ? PyC) to the measured TOC values. From this plot (Fig S6), it became clear that our predictions broke down for samples with greater than 100 gC kg -1 coinciding with the highest measured TOC value in our calibration set. These high SOC samples (n = 488) were then excluded from further analysis. A small number of samples (n = 72) showed extreme mismatches between the sum of fractions and observed TOC data. These mismatches may have arisen from laboratory errors in analyzing the wrong sample for TOC or FTIR analysis or were just really poorly predicted. Either way, these 72 samples would have too much leverage on model results and have thus been excluded. After removing these 724 outliers from the 8500 total samples, the predicted sum of fractions showed an excellent correspondence with measured TOC data (Fig. S6).  Table S1. Covariate layers were downscaled to 250 m, and transformed to a common projection prior to stacking rasters for random forest analysis. Any missing data were gap-filled using gdal_fillnodata.py. Basic data exploration included plotting trends with depth and subsetting data into surface (0-20 cm) and subsurface (40-70 cm) groups to explore univariate relationships with covariates. Fixed depths were chosen because only 45% of the samples had information on genetic horizons. For the samples that had information on genetic horizon, 96% of the 0-20 cm samples were designated as A horizon and 92% of the 40-70 cm samples were designated as B horizon. Histograms of the maximum of fPOC, fMAOC and fPyC for each soil profile were constructed on full soil profiles (n = 1398) to better visualize the depth distribution of carbon fractions. Linear, semi-logarithmic (log-lin and lin-log) and power (log-log) regression models were applied to depth trends for all fraction data and for trends with clay for fraction data. Differences between major land cover classes (cropland and grassland) was assessed for each fraction using one-way ANOVA after square root transformation of data.
To further explore the non-linear multivariate relationships that are known to control soil carbon distribution at larger spatial scales, quantile random forest (QRF) models (Meinshausen 2006) were built for TOC and all fractions. In order to find optimal tuning parameters in the QRF models a fivefold cross validation procedure repeated 100 times for each of the separate models was implemented. For each model, 500 trees were used and a search grid for the number of variables sampled at each tree split was searched across of a grid of 5, 10, 20, 30, 40 and 50. Within this cross-validation procedure the model which resulted in the largest R 2 was selected. After optimal tuning parameters were selected, the final model performance was evaluated by randomly sampling 80% of the field sites used for model training, and 20% of the sites used for model validation. Lastly, across each of the eight dependent variables modeled, spatial predictions were acquired at seven different soil depths (1, 3.5, 7.5, 15, 25, 37.5, and 67.5 cm), and at each depth three quantiles were predicted to get a measure of per pixel level uncertainty corresponding to 2.5, 50 and 97.5% confidence.

Depth trends
Combining all site data, there was a strong decrease in concentration of each fraction (Fig. 1a-c) with POC decreasing relatively faster than MAOC and PyC (Fig. 1d, e). Power functions best described the decrease in POC (R 2 = 0.42) and fPOC (R 2 = 0.19) with depth, while log-lin functions best described the relationship between MAOC (R 2 = 0.32), fMAOC (R 2 = 0.11) and PyC (R 2 = 0.21) with depth. When all sites were grouped together, there was no trend in fPyC with depth. The maximum proportion of each fraction was found at different depths within the soil (Fig. 2). The POC fraction peaked in the topsoil (0-10 cm), whereas, the MAOC fraction peaked around 30-50 cm. While masked when all the sites are grouped together (i.e. Fig. 1), fPyC displayed, on average, a slight subsurface maxima around 10-20 cm when examining profiles individually (Fig. 2).

Relationships with covariates
In order to explore univariate and multivariate relationships with covariates, data were grouped into surface (0-20 cm, n = 2457) and subsurface (40-70 cm, n = 1512) categories to minimize the overall impact of depth on soil properties. Concentration of SOC and fractions of SOC were correlated with climate (Table S2) with lowest concentrations in warm, dry environments (negative correlations with temperature and solar radiation) and highest concentrations in cooler, wetter locations (strong positive association with cloud fraction and snowfall). Correlations with climatic variables were generally much stronger for the surface than the subsurface horizon. The MAOC and PyC fractions were also positively associated with clay content.
The amount (Table S2) and proportional distribution of OC into fractions (Fig. 3) were correlated with clay content. The sandiest samples, regardless of depth, had disproportionately more POC than MAOC or PyC (Fig. 3). Soil pH was also correlated with the fractional distribution of OC with fPyC increasing with increasing pH (Table S3) but this might simply be a reflection of drier climates where pH would be higher. Correlations with edaphic properties were stronger in surface than subsurface (Table S3). Clay and pH were only weakly correlated (data not shown).
Warmer and wetter climate lead to relatively more MAOC and less PyC but POC tended to be independent of climate (Table S3). In contrast to the findings for carbon concentration, climate-driven trends with the proportional distribution of fraction were stronger in the subsurface than surface horizons. Perhaps due to the large regional scale of this investigation, topographic properties were poor predictors (Tables S2,  S3).
The concentration of SOC and fractions in the topsoil (0-20 cm) varied by land cover classification (Fig. 4a). Grasslands had the highest concentrations of SOC and all fractions with croplands having the lowest. When all soils were pooled together, there was a difference in POC (F = 7.5, P = 0.006), a smaller difference in MAOC (F = 2.9, P = 0.09) and no difference in PyC. The relative difference between land cover classes was more dramatic when the analysis was limited by soil type with all three fractions showing highly significant differences (P \ 0.001). For the most common Great Group (Argiustolls, n = 650), there was 33% less POC, 29% less MAOC and 25% less PyC in cropland than grassland samples (Fig. 4b). Given the variability across this large region, the tendency for a preferential loss of POC over other fractions did not change the average distribution of the fractions (data not shown).

Random forest modeling
The spatial distribution of SOC and carbon fractions were well predicted using a quantile random forest model applied to a stack of spatially continuous co-variate layers. Models produced minimal bias, a slope close to unity, R 2 values between 0.66 and 0.76 and RPD values around 2.3 for all properties (Table 3). Sampling depth was always the most important variable in the QRF models (Fig. S8). As a singular variable, sampling depth explained 43, 32, and 21% of the variance in POC, MAOC and PyC, respectively (Fig. 1). After accounting for sampling depth, the importance of the remaining variables varied depending on fraction. Averaged across months, temperature was the second most important variable for POC, whereas MODIS cloud fraction and solar radiation were the second and third most important variables for the MAOC and PyC fractions. Vapor pressure deficit was an important variable for POC but not MAOC or PyC. Monthly precipitation consistently ranked as some of the least important variables. Unfortunately, land cover classification and soil classification could not be included in our random forest models because not all categories present across the Great Plains were represented in the training data.
After projecting the random forest model results across the Great Plains ecoregion, large variation in the stocks of the three fractions were observed (Fig. 5). Associated uncertainty is shown in Fig. S9. In general, there was a trend of increasing amount of all fractions moving from the hot arid southwest to the cool mesic northeast with some interesting exceptions. The cool and dry northwestern edge of the Great Plains contained nearly as much POC as the mesic northeast. The Sandhills region of central Nebraska also stood out as having disproportionally low MAOC and PyC stocks relative to POC. The Sandhills region also stands out as have much greater uncertainty (Fig. S9).

MIR spectroscopy
In this study, we took advantage of the ability to predict soil properties from easily obtained MIR spectra in order to greatly expand data availability and thus scope of inference (Nocita et al. 2015). In particular, we demonstrated that a moderately sized dataset (n = 565) of carbon fractions with appropriate processing steps could be successfully used to make predictions of carbon fractions on nearly 8000 soil samples from a different region of the world. Two different technical challenges needed to be addressed in the process of making these predictions.  First, the existing carbon fraction database came from Australian soils that were not necessarily representative of the types of soils encountered in the Great Plains ecoregion of the United States. In order to assess the magnitude of this challenge and provision data to resolve the problem, we fractionated 100 soil Fig. 4 Mean concentration of particulate (POC), mineral associated (MAOC) and pyrogenic (PyC) fractions grouped by land cover classification for all topsoil (0-20 cm) samples (a) and for the most common soil type (Argiustolls) (b). Error bars represent 2 standard errors. Significance differences in fractions between land cover classes given as: ns not significant, *P \ 0.1, **P \ 0.01, ***P \ 0.001  From the standpoint of carbon fraction content (Fig. S4) and distribution of SOC into fractions (Fig. S5), the two regions do not appear too different. Additionally, we compared the spectral data from the two regions in Principal Component space and see that there is good overlap (Fig. S7). Despite the similarities between the two sample sets, Table 1 indicated spiking of the Australia spectral dataset with the local Great Plains samples was necessary for really good unbiased predictions, particularly for the PyC fraction. Several other studies have also demonstrated that spiking with local samples is an effective method for applying spectral libraries developed outside the region of interest (Guerrero et al. 2016;Lobsey et al. 2017), although not necessarily for all soil properties (Gogé et al. 2014). Second, the MIR spectra of the Great Plains soil database were acquired on a different FTIR spectrometer necessitating the need to evaluate calibration transfer methods. Previous work has shown that large differences in the raw spectra for the same samples acquired on the different instruments are reduced following first derivative transformation, and that PDS successfully eliminate most of the variability between the primary and secondary spectra . In this work, we found that calibration transfer using piecewise direct standardization did not improve R 2 of predictions but was necessary to remove bias in predictions resulting in up to a 50% reduction in RMSE (Table 2). Whether or not calibration transfer is always necessary is still a largely unresolved issue in soil spectroscopy and it may depend upon the specific combination of instruments and soil properties to be predicted .
The comparison between the spiked Australian model and the US-only model illustrated that despite any additional uncertainties that may have been introduced through the calibration transfer process the larger training set outperformed the US-only model, primarily for POC and MAOC fractions, both in terms of internal validation and applicability to a wider range of samples when applied to the 8500 samples in the Great Plains database.

Carbon fractions
Knowledge of only the total stock of SOC is likely insufficient for understanding SOC response to perturbations and for designing and reporting on SOC sequestration programs. Carbon fractions can provide a much more nuanced understanding of SOC change (Lavalle et al. 2020;Viscarra Rossel et al. 2019). In general, the faster cycling POC fraction is expected to provide for much of the energy demand of the soil microbial community while the MAOC fraction, through interactions with reactive mineral surfaces, contributes the most to longer time SOC accrual. The PyC, being of pyrogenic origin, interacts with soil microorganisms in a fundamentally different way leading to much greater persistence (Schmidt et al. 2011). The SOC fraction data generated in this study of the Great Plains ecoregion of the United States reveal a number of insights in SOC dynamics that we detail in this section.

Depth trends
An advantage of working with the NSSC-KSSL database is that full soil profiles were collected at nearly all locations allowing for a detailed exploration of the vertical allocation of SOC into fractions. Plotting the proportion of each fraction against depth for all samples revealed a rapid decrease in fPOC, increase in fMAOC but no trend in fPyC (Fig. 1d, e). A different picture emerged when considering each profile as a set of related data (Fig. 2). The frequency distribution of the maximum value for fPOC, fMAOC and fPyC for each profile clearly indicated POC is concentrated in the topsoil (0-10 cm), PyC peaks slightly deeper (5-20 cm), and MAOC reaches a maximum in subsoil horizons (30-50 cm).
The distribution of SOC fractions with depth is revealing and affirms the importance of including vertical transport processes when modeling SOC dynamics (Baisden et al. 2002;Koven et al. 2013) even if only considering the top 20 or 30 cm of soil. The POC fraction is directly derived from recent plant inputs, whether surface deposition of aboveground litter or root turnover, which decreases exponentially with depth, and has a relatively short mean residence time. For most soils, where vertical transport is on the order of a 0.5-2.0 mm year -1 (Elzein and Balesdent 1995;Baisden et al. 2002;Kaste et al. 2007), this means that most POC is mineralized or transformed into more stable MAOC by microbial cycling before moving deeper in the soil profile. The co-location of fMAOC maximum with typical clay maxima can also be explained by the relatively rapid loss of POC, vertical transport of dissolved organic carbon and subsequent retention on mineral surfaces with a greatly reduced mean residence time (Sanderman and Amundson 2008;Kaiser and Kalbitz 2012).
The slight subsurface maxima for the PyC fraction has been seen in other studies where full profiles were analyzed (Hammes et al. 2008;Hobley et al. 2016;Wang et al. 2018) and can also be interpreted in terms of relative production, transport and consumption (Major et al. 2010). The PyC fraction being pyrogenic in origin enters the soil primarily on the surface following wildfire. While a lot of PyC may be transported horizontally (Abney et al. 2017), some PyC will get incorporated into the soil profile through slow mixing processes. Importantly, the PyC fraction is not completely inert and is slowly lost to microbial processing and dissolution on timescales of centuries (Hammes et al. 2008) allowing for transport to much greater depths than POC.
While we have emphasized trends in the relative distribution of fractions in this section, it is important to emphasize that our fractionation methodology has estimated that on average 24% of TOC is consistent with a pyrogenic origin (Fig. 1). Using similar NMR spectroscopy-based methodology, Skjemstad et al. (2002) found that 10-35% of TOC was PyC across a range of US soils. However, estimates of PyC using the benzene polycarboxylic acid (BPCA) biomarker technique and hydrogen pyrolysis have fallen in the range of 4-18% for similar soils (Glaser and Amelung 2003;Lavallee et al. 2019). This discrepancy is to be expected as the NMR technique measures more of the PyC continuum than either of the other two methods (Reisser et al. 2016;Schmidt et al. 2001).
The role of clay Rasmussen et al. (2018), synthesizing a broader selection of data from the NSSC-KSSL database, concluded that clay content alone does not appear to control SOC stabilization. Findings from this study where we focused on a large but relatively homogenous region, do suggest that clay content alone is still important to consider in representing SOC dynamics (Tables S2 and S3). In the topsoil (0-20 cm) and subsoil (40-70 cm), SOC was positively correlated with clay content (r = 0.28 and 0.23 for top and subsoil, respectively). However, a more nuanced picture emerges when looking at the correlations with the individual SOC fractions. Fitting with our conceptual descriptions of the formation and fate of these SOC fractions above, POC was not correlated with clay but MAOC showed a strong positive correlation for both depth increments (r = 0.43-0.44). Comparing the distribution of carbon into fractions with clay content also suggested a strong role for texture in controlling SOC dynamics (Fig. 3) with disproportionately more POC in sandy soils but MAOC and PyC dominate once clay content increases above about 8%. The role of clays and reactive mineral surfaces, generally, in controlling the stabilization of the MAOC fraction is well established (Kögel-Knabner et al. 2008;Kramer et al. 2012;Jastrow and Miller 2018). The PyC fraction was also correlated with clay content for both depths (r = 0.29-0.41); a finding that fits with the growing evidence that the PyC fraction is also retained via mineral association (Cusack et al. 2012;Soucémarianadin et al. 2014).

Land use patterns
The data in the NSSC-KSSL database indicate a strong impact of land use on topsoil SOC content. Soils collected from locations classified as cropland had significantly less SOC in the top 20 cm than from soils classified as grassland. When limiting the data to just Argiustolls, the most common Great Group (n = 650), there was 30% less SOC. This level of land use impact on SOC is consistent with other datasets (Guo and Gifford 2002;Sanderman et al. 2017) and is likely an underestimate of the true impact of cropping because most of the cropland sites are in the eastern wetter portions of the study region where we would expect higher SOC levels under native vegetation.
While there was relatively less C in all three fractions in cropping versus grassland sites, there was a relatively greater loss of POC than MAOC and even smaller loss of PyC (Fig. 4). These findings are consistent with observations from paired plots (Poeplau and Don 2013) and long term trials (Skjemstad et al. 2004) where the POC fraction was most sensitive to land use change. While the PyC fraction should be the least susceptible to microbially-mediated loss, our results suggest this fraction was still susceptible to loss over the approximately 100 year history of agriculture in this region. The most likely explanation for this apparent loss of this stable SOC fraction is the well documented history of accelerated erosion in the Great Plains (DeLuca and Zabinski 2011). Water and rill erosion will remove all three fractions more or less equally (Sanderman and Chappell 2013).
Integration with predictive soil mapping Using MIR-estimates of difficult to measure soil properties to increase data available to predictive soil mapping efforts is a valuable application of MIR technology (Hengl et al. 2015;Mirzaeitalarposhti et al. 2017;Viscarra Rossel and McBratney 2008). In this work, we have demonstrated how detailed measurements on 565 samples can be utilized to make estimates of these measurements on nearly 8000 samples using MIR spectroscopy. Estimates of carbon fractions for 7776 samples spread across 1482 geolocated soil profiles were then successfully predicted using a machine learning predictive soil mapping approach (Table 3) to arrive at 250 m resolution digital maps of carbon fractions for the Great Plains ecoregion (Fig. 5).
Several recent studies have suggested that the ratio of fast cycling to slower cycling carbon fractions can be used as an index of carbon vulnerability (Baldock et al. 2018;Viscarra Rossel et al. 2019;Gray et al. 2019). Here we have taken the data presented in Fig. 5 and calculated carbon vulnerability (V C ) as POC/ (MAOC ? PyC), with results presented in Fig. 6. In the Great Plains, we found a broad pattern of increasing vulnerability with increasing latitude (a good proxy for annual temperature). However, in the cooler and dry regions of the Great Plains, carbon vulnerability was also high. The region of sandy soils in Nebraska also stood out as a region of exceptionally vulnerable carbon. We found that vegetation greenness was weakly correlated with total stocks and distribution of fractions (Tables S2 and S3), a finding that contrasted with the work of Ahmed et al. (2017) who found greenness (as normalized difference vegetation index) was the single strongest predictor of PyC and POC stocks for the Wyoming, Colorado, Kansas and New Mexico region. Much of the difference between this present work and that of Ahmed et al. (2017) is likely the fact that we limited our study area to non-forest and non-shrub areas whereas the study region of Ahmed et al. (2017) included cropland, grassland, shrubland and forestland.
In a similar application of mapping MIR-estimated carbon fractions across the Australian continent, Viscarra Rossel et al. (2019) found that the hot semiarid and arid expanses that cover much of Australia had very low V C with higher vulnerability along the cooler and wetter southern and eastern rim of the continent. Mapping carbon fractions for just the state of New South Wales in Australia, Grey et al. (2019) found that lithology (siliceous versus mafic) was a strong driver of the relative abundance of fractions with soils derived from mafic parent material containing disproportionately less POC.
Maps, such as presented in Figs. 5 and 6, can form the basis for modeling efforts to better understand climate and land use feedbacks on soil carbon dynamics. For example, Lehmann et al. (2008) demonstrated that SOC model feedback to global warming would overestimate twenty-first century CO 2 emissions by about 20% if the PyC pool was ignored. More generally, having accurate estimates of the distribution of SOC into carbon fractions can greatly reduce model uncertainty (Lee and Viscarra Rossel 2020). In fact, due to the greater model confidence when initialized to measureable carbon fractions, soil carbon accounting in Australia's National Greenhouse Gas Inventory has been using a simulation model calibrated to these carbon fractions for over a decade (Skjemstad and Spouncer 2003;Skjemstad et al. 2004).

Conclusions
The dominant forms that SOC have accumulated in can matter as much as the total stock for predicting SOC response to perturbations. However, the labor and cost of applying a robust fractionation scheme severely limits the number of soil samples than can be fractionated particularly when you are interested in quantifying SOC that is pyrogenic in origin. In this work, we have extended and applied MIR spectroscopy-based predictive models for POC, MAOC and PyC fractions developed in a previous study to approximately 8000 samples from the Great Plains ecoregion of the United States. Finally, spatial covariates were able to capture much of the variance in the distribution of each fraction, allowing us to develop spatially continuous maps of these fractions for the Great Plains ecoregion. Interpretation of the SOC fraction database and derived spatial maps provided important insights into the vertical (i.e. soil profile) and horizontal (i.e. landscape scale) distribution of SOC fractions. These insights can be used to better parameterize carbon cycle models and the maps can form the basis for initializing soil carbon models to Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Data availability All carbon fraction measurements, predictions and associated covariate data used in analyzing spatial trends and generating maps are available at Harvard Dataverse (https://doi.org/10.7910/DVN/NW8K2P). Maps of soil properties can also be downloaded from the same repository. The USDA NSSC-KSSL soil spectral library is freely available by request from the USDA (contact rich.fergu-son@usda.gov). The Australia carbon fraction data and spectra can be requested from CSIRO (contact ryan.farquharson@csiro.au).