1 Introduction

1.1 Scientific problem

Terrestrial leached carbon (C) is commonly ignored in global C-cycle studies [1,2,3,4] leading to an inaccurate quantification of the terrestrial C budget [3, 5]. Meanwhile, several studies show that leaching and transport through groundwater of dissolved C is a major C source to surface waters [6,7,8,9]. These freshwaters are important active systems in the global C cycle, linking the land and the ocean [10]. Global inland waters are estimated to process between 1.9 and 5.1 Pg C year−1 [11], indicating the magnitude of the C flux into streams, rivers, lakes and reservoirs is fraught with uncertainties [11,12,13,14]. This C flux from global terrestrial systems comprises litterfall from vegetation in floodplains and riparian zones of primarily headwaters, surface runoff, leaching from soils and processing via groundwater to surface waters. Several of these pathways exporting C from the land commence in soils [15, 16]. The current study concerns C in this first part of the terrestrial-aquatic continuum: dissolved C in soil solution (soil pore water).

A global spatially distributed estimate of dissolved C in soil solution may aid in the development of process-based global models (e.g. [17,18,19]) to constrain the C flux from terrestrial systems into inland waters [11, 20, 21]. However currently, global maps of dissolved C in soil solution are yet lacking. In the light of this knowledge gap, we address the following research questions:

  • How are dissolved C concentrations in soils spatially distributed on a global scale?

  • How can we make a spatially explicit first estimate of dissolved C concentrations in soils on a global scale, in order to constrain large-scale C fluxes from soils?

Dissolved C in soil solution includes both organic (DOC) and inorganic C (DIC). In sampling studies, a range of factors has been identified as potential drivers of DOC and DIC concentrations in soil solutions. However landscape heterogeneity often implies that drivers at one location cannot be identified at another sampling plot. For example, while in several plot studies the soil carbon-to-nitrogen ratio (C/N) is found to relate positively to DOC concentrations [2, 22, 23], this is not recognised on the ecosystem level [24], while other studies identified even negative correlations [25, 26]. Inventories of dissolved C in soil solutions, which could provide insight into spatial patterns of dissolved C, generally focus only on a specific region or environment like grasslands [27], temperate forests [24, 28], specific forest types [29, 30] or other vegetation types [23, 31]. These studies therefore only represent a specific environment [32,33,34]. Similarly, modelling studies on dissolved C in soils focus on a specific site, region or specific environment [35,36,37,38,39,40], although a few models have been developed at the scale of large river basins, countries or parts of continents [17, 41,42,43]. [34] presented a first continent-wide data inventory analysis and regression model for DOC in soil solutions for mainly European forests [34, 44].

However, a global scale data inventory and analysis of DOC and DIC in soil solutions covering a wide range of vegetation types and climates is currently yet lacking, inhibiting identifying a set of drivers explaining dissolved C concentrations at that scale [33, 34, 45]. In this study we present the first global open-access database on annual average DOC and DIC concentrations and fluxes in soil solution (mainly DOC). The database covers different environmental conditions and includes a range of potential drivers. Using this database we developed global (multi-)regression models to calculate topsoil DOC concentrations, and DOC concentrations versus depth in the subsoil. Our models enable making a first estimate of spatially distributed dissolved DOC concentrations in soil solution at the global scale. In a follow-up study (Langeveld et al. in prep.), these models will be included in the Integrated Model to Assess the Global Environment (IMAGE) Dynamic Global Nutrient Model (DGNM) [19] to model the global DOC fluxes from soils and also constrain the C fluxes from the terrestrial to the aquatic system.

1.2 Introducing dissolved C biogeochemistry in soils

DOC and DIC may be removed from soils through leaching and surface runoff [15]. In the unsaturated (or vadose) zone, dissolved C in soil pore water between the surface and the saturated groundwater zone is considered, where DIC and DOC concentrations generally change with depth during transport to aquifers [24, 46, 47]. A range of factors has been identified in studies as potential drivers of DOC and DIC concentrations in soil solution, including hydrology, land cover or vegetation type, climate, temperature, terrestrial C fluxes, and soil class (Table 2). In addition, a range of chemical and physical soil properties has been identified as controls of C leaching, such as soil organic carbon-to-nitrogen (C/N) ratio, soil organic carbon (SOC) content, pH, and texture (Table 2).

DOC in soil solution can increase through direct leaching from a higher layer, decomposition of organic matter, or desorption from soil particles [48]. DOC concentrations are highest in the upper organic soil horizon, declining by 10–50% in the subsoil [31, 46]. In most soils, apart from removal due to transport, prime controls of DOC concentrations are biodegradation in the topsoil and adsorption and biodegradation in the subsoil [24, 48,49,50].

Dissolved CO\({_2}\), bicarbonate and carbonate together make up DIC. In the upper horizons or topsoil these carbon species are constrained by the local atmospheric CO\({_2}\) concentration and soil pH. DIC stems from weathering, CO\({_2}\) dissolution and respiration by plants, animals and microbes using DOC and particulate organic carbon (POC) [8, 47, 51]. DIC concentrations typically increase with depth due to organic C biodegradation. Oxygen strongly impacts biogeochemical processes like decomposition in the unsaturated zone [46, 52, 53].

2 Methods

The first part of the method Sect. 2.1 describes the construction of the database. The intermediate Sect. 2.2 addresses how some of the limitations of the database are approached while preparing the data for a global analysis and subsequent global model construction. This section necessarily already takes some results of the database into account. The final Sect. 2.3 addresses the statistical analysis and model construction. The database is not provided in the article but can be found in the Supplementary Information (SI), SI1.

2.1 Database construction

We collected published studies reporting measurements of dissolved C in soil solutions in the unsaturated zone, covering 762 entries from 351 sampling sites distributed over the main climate zones (Fig. 1 and Table 1). From these studies, all individual measurements (i.e., per location and depth) were included in the database, together with ancillary information, such as climate, soil and land cover. Studies were selected when reporting on DOC or DIC concentrations. Additionally, a few studies only reporting on DOC or DIC fluxes were added. When available, measurements on related forms such as biogenic DIC or alkalinity were also included. To be included in the database, data had to be easily available and shareable to the scientific community, so the database can stay open-source. For the database to be representative of large-scale spatial trends in dissolved C content, without seasonal or temporary effects [4, 31, 45, 55], we included only annual average values. These annual averages were included in the database as reported, or calculated from at least four measurements representing shorter time periods covering at least 1 year. A number of measurements were found not to fulfill the latter criterion and are thus reported under a different category as non-yearly averages (See Table 1). Similar to [24] we included both flux-weighted and straight average annual concentrations, as on a yearly scale the difference between both methods is limited [56]. All selected data are from measurements in the unsaturated zone unless specified differently, for example for peat soils (Histosols) with a high water table.

Fig. 1
figure 1

Sampling sites reported in the database, distributed over main Köppen–Geiger climate zones [54]

Table 1 Number of entries and sampling sites the in database for all carbon leaching variables, and their coverage over all main climate zones

Entries in the database for the same sampling site but for different depths have the same unique sampling ID. Thus, entries with the same sampling ID can be analysed for attenuation profiles over depth. All entries were initially reported as a topsoil if reported as such or, when unknown, when sampled shallower than 10 cm (when sampling depth available). Other soils were classified as subsoils. As in studies the classification of either top or subsoils was not consistent, an alternative classification following [57] was also included in the database, classifying all samples measured within 20 cm as topsoils and all samples from below 20 cm as subsoils. This second classification enables comparison of the data within consistent soil layer classes, and is therefore used in our analysis.

Information on general, environmental and soil characteristics, soil properties or data on the terrestrial C budget were included when available (see Table 2 for the available meta-data for DOC concentrations, or the database in SI 1). Based on the descriptions in the literature, we identified classifications for soil horizons, United States Department of Agriculture (USDA) soil class [58, 59], vegetation and, when available, soil texture (see SI 1). Several climate classifications were extracted from global climate maps representing the second half of the 20th century according to [54] SI 1 and Fig. 1).

DOC concentrations in the subsoil were calculated as relative concentrations compared to those in the topsoil. This approach allows to include data from laboratory experiments, since we assume biases inherent to incubation experiments are consistent in topsoil and subsoil. Relative concentrations were calculated for locations with data for various depths, provided that the shallowest depth is < 10 cm (with 10 cm taken as the median of the topsoil according to the classification following [57]).

In addition to the values reported in literature, data were extracted for 19 different parameters from International Soil Reference and Information Centre (ISRIC) soil data of the Harmonized World Soil Database (HWSD) [57, 60] for the corresponding grid cell coordinates of every sampling site in the database. We included these 30 second-resolution data (1°/120°), as well as different types of aggregations to 30 min (0.5°), in the topsoil analysis. Further, temperature and precipitation data [61] (30 min, long-term average + trend; data available from the climate database of the IMAGE model [62]) were extracted for every sampling site in the database for the corresponding grid cell. The database was compiled in Microsoft Excel, data were analysed with codes written in R or Python.

2.2 Limitations on a global scale

The distribution of data on potential drivers of DOC concentrations appeared unbalanced; the choice of factors included in sampling studies varies (Table 2). This is a problem earlier recognized on a smaller scale [48, 63]. As a result of these data gaps, for many factors, analysis is only possible on a limited part of the DOC concentration data. Even more, including several factors strongly reduces the amount of data involved (e.g. for topsoils, including pH, SOC and CN cuts the amount of data from 255 to 40), thereby impeding analysis at a global scale. Indeed, significant relations may be found for sub-sets of the data, but these relations only represent a certain limited environment. For example, for data exclusively from Spodosols (analysis including all soil types gave no significant results), a (relatively weak) relation can be found for C/N ratio versus DOC concentrations (SI 6). However, a model like this would only depend on a limited amount of data and is only valid for this soil type, based on data almost exclusively found in temperate forests. Such a model does not fit with the objective of our study. Data from soil databases may be an alternative; However, soil data from [60] (30 second-resolution data, as well as the dominant and mean for aggregated data to 30 min) yielded a poor correlation with observed topsoil DOC concentrations or other factors. For example for SOC, the Pearson correlation coefficient was not higher than 0.07 for the three extracted datasets.

We aim to constrain DOC concentrations on a global scale and thus not site-specific. A general aspect is that on such temporal scales integrated C cycle fluxes may be strongly related to ‘average’ biophysical conditions like in many other large-scale models such as Century [64] and LPJmL [65,66,67]. This makes empirical relationships between DOC concentrations and environmental conditions useful for bridging the gap between site and landscape scales. Thus, from the database we selected those factors with a global data coverage and therefore had to ignore some soil properties and environmental factors with potential predictability. Instead we use, beside climate and soil data extracted from grid, factors such as soil class, which often capture overall environmental conditions that determine soil forming factors and the resulting soil physical and chemical characteristics [4, 68], and land cover/vegetation [32, 37, 69,70,71], as a proxy for the soil C cycle, and climate zones representing temperature and moisture conditions [68, 72]. In other studies similar generalized controlling factors have been proposed, such as “biological activity” [49], “microbial metabolism” [73] or a “physico-chemically dominated control” [45]. In that respect a C model using ‘average’ biophysical conditions is consistent with other global C models that use biome type as a basis for modelling global soil respiration [74] or climatic life zones for estimating global soil organic carbon pools [75].

2.3 Statistical analysis and model construction

All numerical variables were plotted against each other and studied using simple descriptive methods (pearson R, spearman R, mean, median, standard deviation), for all classifications such as land cover type, top/sub soil or soil type. The effect of the type of sampling method is also analysed. Several transformations were examined for all C data (ln, log10, square root, square, reciprocal). DOC concentration data by far have the largest global coverage (Table 1). Therefore, only yearly DOC concentration data were selected for the construction of the global model. We used Box plots, QQ-plots and histograms to present and analyse frequency distributions and to identify outliers. Because most data are not normally distributed, the Mann–Whitney U test was used to compare differences between subgroups. We constructed different models to calculate the annual average DOC concentrations in topsoil and subsoil.

The topsoil model is constructed in a combined forward/backward multiple-linear regression analysis using the ‘step’ function in R [76], correcting with the Akaike information criterion (AIC) for the number of factors included. Covariate non-linearity was examined using partial residual plots (PRP), taking first the variables that show the strongest non-linearity, following [77]. Added variable plots (AVP) were used to identify outliers (isolated largest residuals or largest partial leverages) [78] (See SI 7). The model performance was expressed by the coefficient of determination (R\(^{2}\)) and the root mean square error (RMSE).

The subsoil model was constructed as a relative attenuation function with depth starting from the topsoil concentration. Codes for data processing, regression analysis and model construction were developed in R or Python.

Table 2 Meta-data overview, database selection of yearly DOC concentrations (550/762 entries)

3 Results and Discussion

This chapter contains four sections. In the first Sect. 3.1, the general results of the database are presented and summarized. The Sects. 3.2 and 3.3 contain the results and analysis for topsoils and subsoils. Both sections first address the database results for the respective soil layer (Sects. 3.2.1 and 3.3.1). Subsequently, the models for topsoil (Sect. 3.2.2) and subsoil (Sect. 3.3.2) are presented and discussed. The final section addresses the limitations and application of the models.

3.1 General aspects and data selection

The database contains mainly data for annual average DOC concentrations and fluxes (72% and 37% of the database entries), with only limited data for DIC (< 11%) (Table 1). DIC data generally show concentration/flux profiles as expected, increasing with depth [8, 47, 51, 106] due to less efficient exchange of CO\({_2}\) with the atmosphere [107, 108]. However, the DIC data are scant, which inhibits analysis and model development on a global scale. Therefore, we selected only DOC data for further analysis.

The number of DOC concentration data clearly exceed that of DOC flux data. More importantly, the global coverage of DOC concentration data is much larger as it covers all main climate zones, unlike the DOC flux data. This is of key importance, as we aim to assess dissolved C on a global scale. Therefore our further analysis focuses on DOC concentrations (Table 1).

We analysed the data according to the consistent topsoil–subsoil classification following [57]. Annual average DOC concentrations range from 1.7 to 88.3 (median = 24.43) mg C L−1 for topsoils and from 0.42 to 372.1 (median = 6.65) mg C L−1 for subsoils (excluding laboratory incubations). When excluding data from Histosols, median values are 25.27 and 5.50 mg C L−1 for topsoil and subsoil respectively. DOC concentrations decrease with depth from the organic rich topsoils down to the subsoils (Fig. 2). Earlier, smaller dataset compilations from forests by Dalva and Moore [46] and Michalzik et al. [24] show similar frequency distribution profiles and concentration ranges. Both the whole data set and horizon sub-sets are positively skewed distributed (Fig. 2 and SI 3), similar to the distribution of several soil properties in global datasets [57, 60].

Fig. 2
figure 2

DOC concentration (mg C L−1) data distributed over horizon classes (n = 465, 85% of all DOC concentration data, laboratory incubation data are excluded). Boxplot bars are medians. Whiskers extend up to 1.5 times the interquartile range (IQR) from the lower and upper quartile, unless exceeding the minimum or maximum. Individual data are jittered for visualization purposes. Five high non-lab-incubation values from peat samples (up to 372 mg C L−1) are outside the figure range, but included in the distribution

Fig. 3
figure 3

Top and subsoil DOC concentration (mg C L−1) data distributed per measuring method (n = 550). Boxplot bars are medians. Whiskers extend up to 1.5 times the interquartile range (IQR) from the lower and upper quartile, unless exceeding the minimum or maximum. Circles are values exceeding the whiskers. Two subsoil values (up to 372 mg C L−1) from Histosols are not shown, but included in the distribution. ’Passive device’ is a shorter name for the group ‘Passive bottle/well/tray/ditch/gauge/plate’

A range of different techniques has been used for measuring DOC concentrations. Laboratory experiments yield significantly higher concentrations in both topsoils (\(p< 0.001\), Fig. 3) and subsoils (\(p<0.001\)) than other methods, with five times higher median concentrations in topsoils. This can be attributed to soil sample disturbance and different conditions, that are often not representative for field conditions [32, 109,110,111,112,113]. In a review study, [48] conclude that relations identified in laboratory studies can often not be confirmed in the field, in particular for subsoils [48]. Therefore, we used data from the laboratory experiments only for analysing the relative concentration changes with depth in subsoils, for which we the use retained fraction compared to the topsoil.

In subsoils, piezometer-based concentrations are much higher than for all other sampling methods (\(p<0.001\), Fig. 3), as these samples are almost exclusively taken in Histosols, where DOC degradation is inhibited by high groundwater tables [85]. Where Zsolnay [114] and Buckingham et al. [56] found different concentrations between zero-tension and tension lysimeters [27], we did not observe this in neither top- (\(p>0.1\)) or subsoil data (\(p>0.1\)). Thus, both methods are combined in the analysis [30, 115].

Fig. 4
figure 4

Precipitation data from [61] versus DOC concentrations (mg C/L) (n = 533). Laboratory incubation data are excluded. Five DOC subsoil peat values (111.25–372.1) are not shown, but included in the regression. R\(^{2}\) values for all data, topsoil data and subsoil data are respectively 0.0325, 0.129 and 0.0073, only significant for all data and topsoil data (\(p<0.0001\))

For either top or subsoils, no single potential numerical driving factor showed a clear correlation with DOC concentrations in a simple (linear) regression (maximum \({\mathrm{R}}^{2}=0.13\); see for example Fig. 4). This suggests that DOC concentrations are not driven by one uniform first-order governing factor, but by a set of interrelated drivers and controls that are spatially variable [32, 91]. For example [23] showed for temperate forest soils that both DOC production and heterotrophic respiration (HR) increase with temperature and also soil C/N ratios [23, 116].

Fig. 5
figure 5

Top and subsoil DOC concentration (mg C L−1) data distributed over a USDA soil classes, b sub-climate zones according to [54] and c land cover groups. Laboratory incubation data are excluded, n = 533. Boxplot bars are medians. Whiskers extend up to 1.5 times the interquartile range (IQR) from the lower and upper quartile, unless exceeding the minimum or maximum. Circles are values exceeding the whiskers. Two subsoil values (up to 372 mg C L−1) from Histosols are not shown, but included in the distribution. In a, data for soils with a double or mixed classification is represented in both single classes. See SI 3 for similar distributions as for b, c, but excluding Histosols

Measured temperature and precipitation show a good correlation to corresponding climate data (R\(^{2}\) of 0.94 and 0.49). As the measured data are not available for all measurements, we used the temperature and precipitation data from [61] instead of the measured values in our further analysis.

Concentrations of DOC in top and subsoils differ by soil class, land cover type and climate zone (Fig. 5). In contrast to other soil classes, Histosols generally contain higher DOC concentrations in the subsoil compared to the topsoil [53, 85, 117]. This has a clear confounding effect on some of the classified data, e.g. for the land cover type ’grass agriculture’, which seem to have higher median concentrations in subsoils than topsoils (Fig. 5). However, when we exclude the Histosols, a more consistent pattern is shown (SI 3), with concentrations clearly decreasing with depth.

3.2 Topsoils

3.2.1 Database

Fig. 6
figure 6

Topsoil DOC concentration (mg C L−1) data distributed over horizon classes (n = 230, 90% of all topsoil DOC concentration data). Laboratory incubation data are excluded. Boxplot bars are medians. Whiskers extend up to 1.5 times the interquartile range (IQR) from the lower and upper quartile, unless exceeding the minimum or maximum. Individual data are jittered for visualization purposes

For the topsoil (classification following [57]), 255 entries on DOC concentrations are included in the database (excluding 10 laboratory incubations). The specific horizon within the topsoil being studied can impact the amount of DOC measured. [23] identifies that in a range of forest floor samples, 74% of the DOC is from the organic horizon, 12% from litter and 13% from the deeper roots, emphasizing the role of the upper layers in DOC production. For our topsoil data, O, A and O/A horizons are dominantly present, with lowest DOC concentrations in the A horizon (Fig. 6). Despite being often from saturated soils, values from Histosols do not differ significantly from the other topsoil data (\(p>0.1\)), so do not have to be filtered out in a further topsoil data analysis. Since B horizons are commonly subsoils we excluded them from the analysis of topsoil DOC concentrations.

All main climate zones are covered in the data, although there is a bias towards temperate and continental climates (Fig. 5), an issue recognized in earlier studies [32, 34]. Highest concentrations occur in humid continental climates and lowest in semi-arid climates. Also, oceanic climates have higher values than the three tropical climates (\(p<0.001\)). Generally, we observe higher median concentrations in more moderate climates, with some of the warmer regions having lower concentrations (Fig. 5 and SI 3). This could be explained by the lower decomposition rates in temperate zones, caused by sub-optimal conditions for microbial degradation [23, 49, 72, 73]. Further, a limited or absent litter layer in semi-arid, tundra, and savannah climates restricts the amount of organic C available for decomposition. Where some studies identified different DOC concentrations between coniferous and deciduous forests [70, 71], this was not observed at the global scale (\(p>0.1\)), despite the large number of data entries (Fig. 5), consistent with observations by Michalzik et al. [24]. We therefore aggregated the forest data into one main class (SI 1), as we can use climate zones to account for the climatic impact on the tropical and montane forest data.

3.2.2 Model

The topsoil multi-regression model was constructed involving the variables which are available for all DOC database entries. Therefore, factors like soil texture, C/N ratio or pH could not be included. On the basis of the data entries in the database (244 entries used for the topsoil model) and using the AIC to select the best fitting model, four factors are included in the model for calculating DOC concentration in the topsoil soil solutions:

$$\begin{aligned} DOC_{top}= & {} MAX(0.0, 1.623+coef_{CZ}+coef_{SC}+coef_{LU} -0.008207 P_{ann}) \end{aligned}$$
(1)

where \(DOC_{top}\) is the DOC concentration in topsoil soil solutions (mg C L−1), \(coef_{CZ}\), \(coef_{SC}\) and \(coef_{LU}\) are the coëfficents for respectively main climate zones, soil class and main land cover groups (Table 3), and \(P_{ann}\) is the annual average precipitation (mm year−1) [61]). 31% of the variation is explained by the model (RMSE = 14.9, RSE = 15.5, 242 degrees of freedom). Neither including sub-classes for climate and land cover, nor transforming factors to non-linear functions yielded better model performance. Including soil data extracted from soil databases in the multi-regression analysis did neither yield a higher model performance.

Table 3 Regression coefficients for the topsoil model in equation 1
Fig. 7
figure 7

Modelled topsoil DOC concentration (mg C L−1) versus observed values (n = 264), distributed over the five main climate zones [54]. \(\mathrm{R}^{2} = 0.36\), dotted line is 1:1 line. Entries with reported double/mixed USDA soil classes (n = 20) are duplicated to enable including soil classes in the regression analysis. Laboratory incubation and B horizon data are excluded (see text). Entry #220 is also excluded as identified as a high leverage point in a partial regression analysis using an added variable plot (AVP)

The scatter plot of measured versus calculated DOC concentrations (Fig. 7) shows that the model describes the main trend reasonable (\({\mathrm{R}}^{2} = 0.36\)). However, the model tends to overestimate lower concentrations for temperate and continental zones, while higher concentrations (> 50 mg C L−1) are consistently underestimated. The main reason is that a wide range of DOC concentrations occurs within a limited range of annual precipitation. Precipitation is negatively correlated to DOC concentrations (\(p<0.001\); Fig. 4), though the large residuals indicate considerable uncertainty. For a \(P_{ann}\) in the range of 700–1000 mm year−1 (mainly continental and temperate climates), DOC concentrations range from 1.7 to 87.6 mg C L−1 with high concentrations corresponding to the high values in continental and temperature climates shown in Fig. 7. The uncertainty may be due to the absence of information on the seasonality of precipitation and the multiple ways in which precipitation may influence concentrations, e.g. temporary accumulation in dry periods and flushing in wet seasons, dilution [4, 48, 119] or via soil moisture content [72, 116, 120]. Another cause of uncertainty is the poor spatial coverage of some governing factors, with the result that they could not be used for spatially dependent relationships. For example, C/N ratio may be an important factor in the case of Spodosols in temperate/continental climate zones (\({\mathrm{R}}^{2} = 0.36\)), consistent with several observations [2, 22,23,24, 89].

Fig. 8
figure 8

Modelled global topsoil DOC concentrations in soil solutions (mg C L−1) for the year 2000 with a 30 min resolution

Finally, we calculated the global distribution of topsoil DOC concentrations using grid data on land cover [62], precipitation [61], USDA soil classes from [59] and climate zones by Kottek et al. [54] (Fig. 8). For areas with Aridisols (soil class), ice cover (land use or soil class maps), rock cover, shifting sands, salt plains (soil class maps) or hot deserts (land use maps) we assume the DOC concentration to be equal to that in precipitation, using the median value from the overview by Aitkenheadsps-Peterson et al. [118] (Table 3). Model results yield topsoil concentrations up to 51 mg C L−1, with higher values generally in higher latitude continental zones with abundant forests. Lower concentrations occur in the arid zones with limited vegetation. DOC concentrations in equatorial regions, such as rainforests, are generally below ~25 mg C L−1. In an earlier overview [34] also found DOC concentrations in tropical topsoils to be significantly lower than those in boreal or temperate topsoils (see Fig. 1d [34]). This confirms that the effect of temperature on DOC is site-specific and possibly indirect [89], unlike results from some early forest studies [45, 80]. Probably, low tropical topsoil DOC concentrations are caused by optimal conditions for microbial degradation, reducing the amount of DOC before it leaches into the soil [23, 49, 73]. Topsoil concentrations in Histosols do not exceed those in mineral soils (Figs. 6 and 8), similar to earlier findings [34].

3.3 Subsoils

3.3.1 Database

For the subsoil, 285 entries on annual average DOC concentrations are included in the database (including data measured in laboratory incubations). Of the reported horizons (not present for all DOC data; Fig. 9), B horizons are dominant, with only few data for O/A horizons present. Except for the limited amount of O/A and C horizon values, the data in the horizon classes have similar distributions and medians. Unlike for topsoils, subsoil DOC concentrations in Histosols strongly differ from other subsoils. Low biodegradation rates due to anoxic conditions, caused by high water tables, as well as the high organic matter content in Histosols result in high DOC concentrations at larger depths [85, 117].

Fig. 9
figure 9

Subsoil DOC concentration (mg C L−1) data distributed over horizon classes (n = 234, 82% of all subsoil DOC concentration data). Laboratory incubation data are excluded (n  =7). Values for classes ‘O’ (n = 1) and ‘groundwater’ (n = 1). Two high peat values (372.1 and 222.5 mg C L−1) are not shown but included in the distribution. Boxplot bars are medians. Whiskers extend up to 1.5 times the interquartile range (IQR) from the lower and upper quartile, unless exceeding the minimum or maximum. Individual data are jittered for visualization purposes

When excluding Histosols, higher subsoil concentrations generally relate to higher topsoil DOC concentrations (Fig. 5), whereby the gradients are smaller for areas with low input due to low microbial activity, such as shrublands, tundra or permafrost (Gelisols) [23, 49, 72]. Soils under crops are a special case, with higher DOC concentrations in subsoils than in topsoils (Fig. 5). A possible explanation is DOC excretion by deeper plant roots [121]. However, data for soils under crops are scarce and may not represent actual conditions for croplands, similar to [83].

3.3.2 Model

DOC concentrations in subsoils are regulated through biodegradation and the balance of adsorption–resorption processes [24, 48, 50]. Evidence of a strong influence of specific driving factors is mainly derived from laboratory experiments and less evident than for topsoils [48]. As physical–chemical factors are a main control of subsoil DOC concentrations [31, 48, 50], soil class can therefore be used as a proxy for physical–chemical conditions [100]. In addition, soil classes also partly account for differences in hydrological flow paths [86]. Several field studies identified differences in subsoil DOC concentrations between USDA soil classes [4, 68, 85, 86].

We calculate DOC concentrations in the subsoil as relative concentrations compared to the topsoil. In other words, the relative concentration in the subsoil is the quotient of a concentration at depth x relative to the topsoil concentration. Assuming a constant attenuation with depth within a soil class, an exponential decay function can be fitted for each soil class. Figure 10 shows the example for Spodosols, with the corresponding coefficients for all classes in Table 4. In Histosols, DOC concentrations in subsoils can be both smaller or larger than those in topsoils (Table 4). With a known topsoil concentration for a corresponding soil class, the DOC subsoil concentration at depth x is calculated as follows:

$$\begin{aligned} DOC_{sub}= & {} DOC_{top} \ e^{(x \ coef_{SCsub})} \end{aligned}$$
(2)

where \(DOC_{sub}\) is the DOC concentration at depth x in the unsaturated zone, \(DOC_{top}\) is the topsoil concentration as measured or modelled by equation 1, and \(coef_{SCsub}\) is the soil class-dependent decay coefficient (Table 4).

Fig. 10
figure 10

Data and model for DOC concentrations with depth, relative to topsoil concentrations. Example for Spodosol soil class data (n = 81). \({\mathrm{R}}^{2} = 0.74\) (exponential regression). Two values above 1 (triangles) are excluded from the regression

Table 4 Regression coefficients for the subsoil model in equation 2, based on the nonlinear (weighted) least-squares estimates

For mineral soils, R\(^{2}\) values of the regressions range from 0.23 (Oxisols) to 0.74 (Alfisols and Spodols), up to 0.96 (Vertisols, but limited data) (Table 4). For some soil classes, in particular Vertisols, Gelisols and Andisols, the analysis is based on a limited dataset (Table 4) reflecting the small area covered by these soils (Vertisols 2%, Andisols 1%) or the permanently frozen state of the soil (Gelisols) [58, 59]. Except for Histosols and Oxisols, the functions describe patterns of the relative concentrations vs. depth quite well with no consistent overestimation nor underestimation for undeep soils. For deeper subsoils (> 1.0 m) the model seems to consistently underestimate DOC concentrations, for example due to subsoil biodegradation [48, 50], DOC production or exudation by plant roots [121] or DOC release due to desorption from soil colloids [27, 31, 99, 122]. In addition, the degradation of DOC could be overestimated by ignoring increasing recalcitrance of the organic C with depth [123].

Data and profiles for Oxisols and Histosols clearly differ from the other soil classes (SI 5) because in the strongly weathered Oxisols relatively high concentrations occur at greater depths due to high water percolation rates [86]. In Histosols, relative concentrations are observed to be both above (72% of the data) and below (28%) 1.0, with a very low R\(^{2}\) (0.01), depending on the depth of the water table (see footnote c in Table 4).

Fig. 11
figure 11

Modelled global subsoil DOC concentrations in soil solutions (mg C L−1) at a depth of one meter. Based on equation 2 using topsoil data from Fig. 8

Using Eq. 2, the modelled topsoil DOC concentrations (Fig. 8), k-values (Table 4) for USDA soil classes from [59] we calculated the global distribution off subsoil DOC concentrations at a depth of 1 m (Fig. 11). Modelled DOC concentrations at this depth range up to 266 mg C L−1, though values above 20 mg C L−1 only occur in areas with abundant Histosols. Highest subsoil concentrations in non-Histosols are between 5 and 10 mg C L−1 and mainly found in permafrost areas (Fig. 8).

The high concentrations in subsoils of Histosols are consistent with observations by Camino-Serrano et al. [34], who attributed the high DOC concentrations in subsoils to the low hydraulic conductivity in these soils. Histosol concentrations, in particular in subsoils, dominantly depend on the water table depth [53] and therefore the thickness of the oxic zone, and can thus both increase or decrease with depth [85, 117]. High concentrations in subsoil Gelisols are similar to observations of [124, 125] and caused by reduced DOC degradation due to low microbial activity in the permafrost [126].

Our subsoil model uses only a single factor, soil classes to estimate subsoil DOC on a global scale. Although it is of course only a simple model aimed to make a first estimate, one should note that other factors such as climate effects and vegetation are inherently included in the model. For example, Oxisols and Ultisols are typical subtropical and tropical soils [86], while Aridisols are typical for (semi)deserts with generally low vegetation coverage [59, 62]. Thus, the soil type not solely represents soil characteristics, but may indirectly also incorporate the effects of environmental and climatic conditions on DOC. As such, the subsoil model, like the topsoil model, may be viewed as a simplified but effective first step in estimating DOC concentrations on a global scale.

3.4 Limitations, application and perspective

Our global database on dissolved C in soil solutions does not yet include some other factors that have been identified in the literature as potential controls of DOC. For example, cation exchange capacity (CEC) [127], terrestrial acid deposition [128], anion deficit [97, 129] , soil specific surface area [122] or the composition of the DOC [52, 130]. Still, only few studies incorporated in our database include a broad range of parameters, measured in a similar way. To enable further in-depth analysis and to include additional process controls on a global scale, a standardized set of ancillary data and uniform sampling method is required. The ICP Forests program set up such a framework for monitoring in European forests [131], which enabled recent in-depth analysis and model construction on this sub-continental scale [e.g., 34, 44, 43, 128, 132].

Our model simulates DOC concentrations at the global scale, which is a major step forward compared to current, recent large-scale (country, large basin, sub-continental region) models [e.g. 17, 41, 42, 43]. Though the temporal scale is 1 year, temporal downscaling of the yearly average modelled DOC concentrations could be based on relative seasonal variability [e.g. 128].

This model will be included (Langeveld et al. in prep.) in the Integrated Model to Assess the Global Environment (IMAGE) Dynamic Global Nutrient Model (DGNM) [19]. When hydrology [e.g. 133, 134] is known, we can attempt to model the global DOC fluxes and also constrain the C fluxes from terrestrial to aquatic systems on a global scale.

4 Conclusions

We present the first global database on annual average DOC and DIC in soil solutions, covering all main climate zones. As data on DIC are scant, we conducted our analysis and model construction on annual average DOC concentrations. Highest topsoil DOC concentrations occur in forests in humid continental climates, while topsoils of Histosols do not have higher DOC concentrations than other soil classes. In contrast, highest concentrations in subsoils occur in Histosols. Our analysis shows that DOC concentrations are controlled by a complex of processes that vary in space. DOC concentrations from laboratory experiments are consistently higher than values found in the field.

We identified a set of four indirect controls of global topsoil DOC concentrations, i.e. precipitation, climate zones, vegetation types and soil classes. Further, our analysis showed that global subsoil DOC concentrations vs. depth can be modelled for all USDA soil classes. Here, soil class represents generalized physico-chemical properties that are not represented when only climate or land cover are used. Future sampling studies on DOC should be conducted in regions with land cover types currently underrepresented, such as crops, preferably over different soil classes. A standardized set of ancillary data and uniform sampling method would enable further constraining of global dissolved C concentrations in soil solutions.

5 Supplementary material and data availability

Supplementary material, including the database, is publicly available on PANGAEA (https://www.pangaea.de/).