Estimating dissolved carbon concentrations in global soils: a global database and model

Dissolved carbon (C) leaching in and from soils plays an important role in C transport along the terrestrial-aquatic continuum. However, a global overview and analysis of dissolved carbon in soil solutions, covering a wide range of vegetation types and climates, is lacking. We compiled a global database on annual average dissolved organic carbon (DOC) and dissolved inorganic carbon (DIC) in soil solutions, including potential governing factors, with 762 entries from 351 different sites covering a range of climate zones, land cover types and soil classes. Using this database we develop regression models to calculate topsoil concentrations, and concentrations versus depth in the subsoil at the global scale. For DIC, the lack of a proportional globally distributed cover inhibits analysis on a global scale. For DOC, annual average concentrations range from 1.7 to 88.3 (median = 25.27) mg C/L for topsoils (n = 255) and from 0.42 to 372.1 (median = 5.50) mg C/L for subsoils (n = 285, excluding lab incubations). Highest topsoil values occur in forests of cooler, humid zones. In topsoils, multiple regression showed that precipitation is the most significant factor. Our global topsoil DOC model (R2=0.36\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{R}}^{2}=0.36$$\end{document}) uses precipitation, soil class, climate zone and land cover type as model factors. Our global subsoil model describes DOC concentrations vs. depth for different USDA soil classes (overall (R2=0.45\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{R}}^{2}=0.45$$\end{document}). Highest subsoil DOC concentrations are calculated for Histosols.


Scientific problem
Terrestrial leached carbon (C) is commonly ignored in global C-cycle studies [1][2][3][4] leading to an inaccurate quantification of the terrestrial C budget [3,5]. Meanwhile, several studies show that leaching and transport through groundwater of dissolved C is a major C source to surface waters [6][7][8][9]. These freshwaters are important active systems in the global C cycle, linking the land and the ocean [10]. Global inland waters are estimated to process between 1.9 and 5.1 Pg C year −1 [11], indicating the magnitude of the C flux into streams, rivers, lakes and reservoirs is fraught with uncertainties [11][12][13][14]. This C flux from global terrestrial systems comprises litterfall from vegetation in floodplains and riparian zones of primarily headwaters, surface runoff, leaching from soils and processing via groundwater to surface waters. Several of these pathways exporting C from the land commence in soils [15,16]. The current study concerns C in this first part of the terrestrial-aquatic continuum: dissolved C in soil solution (soil pore water).
A global spatially distributed estimate of dissolved C in soil solution may aid in the development of process-based global models (e.g. [17][18][19]) to constrain the C flux from terrestrial systems into inland waters [11,20,21]. However currently, global maps of dissolved C in soil solution are yet lacking. In the light of this knowledge gap, we address the following research questions: • How are dissolved C concentrations in soils spatially distributed on a global scale? • How can we make a spatially explicit first estimate of dissolved C concentrations in soils on a global scale, in order to constrain large-scale C fluxes from soils?
Dissolved C in soil solution includes both organic (DOC) and inorganic C (DIC). In sampling studies, a range of factors has been identified as potential drivers of DOC and DIC concentrations in soil solutions. However landscape heterogeneity often implies that drivers at one location cannot be identified at another sampling plot. For example, while in several plot studies the soil carbon-tonitrogen ratio (C/N) is found to relate positively to DOC concentrations [2,22,23], this is not recognised on the ecosystem level [24], while other studies identified even negative correlations [25,26]. Inventories of dissolved C in soil solutions, which could provide insight into spatial patterns of dissolved C, generally focus only on a specific region or environment like grasslands [27], temperate forests [24,28], specific forest types [29,30] or other vegetation types [23,31]. These studies therefore only represent a specific environment [32][33][34]. Similarly, modelling studies on dissolved C in soils focus on a specific site, region or specific environment [35][36][37][38][39][40], although a few models have been developed at the scale of large river basins, countries or parts of continents [17,[41][42][43]. [34] presented a first continent-wide data inventory analysis and regression model for DOC in soil solutions for mainly European forests [34,44]. However, a global scale data inventory and analysis of DOC and DIC in soil solutions covering a wide range of vegetation types and climates is currently yet lacking, inhibiting identifying a set of drivers explaining dissolved C concentrations at that scale [33,34,45]. In this study we present the first global open-access database on annual average DOC and DIC concentrations and fluxes in soil solution (mainly DOC). The database covers different environmental conditions and includes a range of potential drivers. Using this database we developed global (multi-) regression models to calculate topsoil DOC concentrations, and DOC concentrations versus depth in the subsoil. Our models enable making a first estimate of spatially distributed dissolved DOC concentrations in soil solution at the global scale. In a follow-up study (Langeveld et al. in prep.), these models will be included in the Integrated Model to Assess the Global Environment (IMAGE) Dynamic Global Nutrient Model (DGNM) [19] to model the global DOC fluxes from soils and also constrain the C fluxes from the terrestrial to the aquatic system.

Introducing dissolved C biogeochemistry in soils
DOC and DIC may be removed from soils through leaching and surface runoff [15]. In the unsaturated (or vadose) zone, dissolved C in soil pore water between the surface and the saturated groundwater zone is considered, where DIC and DOC concentrations generally change with depth during transport to aquifers [24,46,47]. A range of factors has been identified in studies as potential drivers of DOC and DIC concentrations in soil solution, including hydrology, land cover or vegetation type, climate, temperature, terrestrial C fluxes, and soil class ( Table 2). In addition, a range of chemical and physical soil properties has been identified as controls of C leaching, such as soil organic carbon-to-nitrogen (C/N) ratio, soil organic carbon (SOC) content, pH, and texture (Table 2). DOC in soil solution can increase through direct leaching from a higher layer, decomposition of organic matter, or desorption from soil particles [48]. DOC concentrations are highest in the upper organic soil horizon, declining by 10-50% in the subsoil [31,46]. In most soils, apart from removal due to transport, prime controls of DOC concentrations are biodegradation in the topsoil and adsorption and biodegradation in the subsoil [24,[48][49][50].
Dissolved CO 2 , bicarbonate and carbonate together make up DIC. In the upper horizons or topsoil these carbon species are constrained by the local atmospheric CO 2 concentration and soil pH. DIC stems from weathering, CO 2 dissolution and respiration by plants, animals and microbes using DOC and particulate organic carbon (POC) [8,47,51]. DIC concentrations typically increase with depth due to organic C biodegradation. Oxygen strongly impacts biogeochemical processes like decomposition in the unsaturated zone [46,52,53]. account. The final Sect. 2.3 addresses the statistical analysis and model construction. The database is not provided in the article but can be found in the Supplementary Information (SI), SI1.

Database construction
We collected published studies reporting measurements of dissolved C in soil solutions in the unsaturated zone, covering 762 entries from 351 sampling sites distributed over the main climate zones ( Fig. 1 and Table 1). From these studies, all individual measurements (i.e., per location and depth) were included in the database, together with ancillary information, such as climate, soil and land cover. Studies were selected when reporting on DOC or DIC concentrations. Additionally, a few studies only reporting on DOC or DIC fluxes were added. When available, measurements on related forms such as biogenic DIC or alkalinity were also included. To be included in the database, data had to be easily available and shareable to the scientific community, so the database can stay opensource. For the database to be representative of large-scale   spatial trends in dissolved C content, without seasonal or temporary effects [4,31,45,55], we included only annual average values. These annual averages were included in the database as reported, or calculated from at least four measurements representing shorter time periods covering at least 1 year. A number of measurements were found not to fulfill the latter criterion and are thus reported under a different category as non-yearly averages (See Table 1). Similar to [24] we included both flux-weighted and straight average annual concentrations, as on a yearly scale the difference between both methods is limited [56]. All selected data are from measurements in the unsaturated zone unless specified differently, for example for peat soils (Histosols) with a high water table. Entries in the database for the same sampling site but for different depths have the same unique sampling ID. Thus, entries with the same sampling ID can be analysed for attenuation profiles over depth. All entries were initially reported as a topsoil if reported as such or, when unknown, when sampled shallower than 10 cm (when sampling depth available). Other soils were classified as subsoils. As in studies the classification of either top or subsoils was not consistent, an alternative classification following [57] was also included in the database, classifying all samples measured within 20 cm as topsoils and all samples from below 20 cm as subsoils. This second classification enables comparison of the data within consistent soil layer classes, and is therefore used in our analysis.
Information on general, environmental and soil characteristics, soil properties or data on the terrestrial C budget were included when available (see Table 2 for the available meta-data for DOC concentrations, or the database in SI 1). Based on the descriptions in the literature, we identified classifications for soil horizons, United States Department of Agriculture (USDA) soil class [58,59], vegetation and, when available, soil texture (see SI 1). Several climate classifications were extracted from global climate maps representing the second half of the 20th century according to [54] SI 1 and Fig. 1).
DOC concentrations in the subsoil were calculated as relative concentrations compared to those in the topsoil. This approach allows to include data from laboratory  NPP net primary production, NEP net ecosystem production, NEE net ecosystem exchange, NBP or NECB net biome production or net ecosystem carbon balance, without leaching. See also SI 1 a Location coordinates as reported in the study, or obtained from using the location description in the study b See classification tables in SI c Where measured at one sampling site at different depths, every observations is a single entry with the same sampling ID d Measured on the site or from weather station data e Sampling depth is reported when available. Shallow or surface level observations with actual depth not reported, while sampling depths are given for other soil layers, are assumed to be 2.5 cm f For 3 entries C/N ratio in soil solutions is used instead g Studies report SOC using three different units. When converting from kg m −2 to % we assumed a layer thickness of 10 cm and a fixed bulk density of 1.0 (topsoil) and 1.3 (subsoil), except for Andisols or Histosols (0.7 kg dm −3 ) h Soil pH measured in wet soil. Some measurements in CaCl 2 were transformed using formula 1 by [105]. A number of studies reports pH measured in solution instead of in soil (14% all pH data). As these are strongly related [85,95], they are included when no soil pH is available i Mentioned studies sometimes consider relations in only top-or a subsoil j Though not a driver, several studies discuss the impact of the sampling method experiments, since we assume biases inherent to incubation experiments are consistent in topsoil and subsoil. Relative concentrations were calculated for locations with data for various depths, provided that the shallowest depth is < 10 cm (with 10 cm taken as the median of the topsoil according to the classification following [57]). In addition to the values reported in literature, data were extracted for 19 different parameters from International Soil Reference and Information Centre (ISRIC) soil data of the Harmonized World Soil Database (HWSD) [57,60] for the corresponding grid cell coordinates of every sampling site in the database. We included these 30 second-resolution data (1°/120°), as well as different types of aggregations to 30 min (0.5°), in the topsoil analysis. Further, temperature and precipitation data [61] (30 min, long-term average + trend; data available from the climate database of the IMAGE model [62]) were extracted for every sampling site in the database for the corresponding grid cell. The database was compiled in Microsoft Excel, data were analysed with codes written in R or Python.

Limitations on a global scale
The distribution of data on potential drivers of DOC concentrations appeared unbalanced; the choice of factors included in sampling studies varies ( Table 2). This is a problem earlier recognized on a smaller scale [48,63]. As a result of these data gaps, for many factors, analysis is only possible on a limited part of the DOC concentration data. Even more, including several factors strongly reduces the amount of data involved (e.g. for topsoils, including pH, SOC and CN cuts the amount of data from 255 to 40), thereby impeding analysis at a global scale. Indeed, significant relations may be found for sub-sets of the data, but these relations only represent a certain limited environment. For example, for data exclusively from Spodosols (analysis including all soil types gave no significant results), a (relatively weak) relation can be found for C/N ratio versus DOC concentrations (SI 6). However, a model like this would only depend on a limited amount of data and is only valid for this soil type, based on data almost exclusively found in temperate forests. Such a model does not fit with the objective of our study. Data from soil databases may be an alternative; However, soil data from [60] (30 second-resolution data, as well as the dominant and mean for aggregated data to 30 min) yielded a poor correlation with observed topsoil DOC concentrations or other factors. For example for SOC, the Pearson correlation coefficient was not higher than 0.07 for the three extracted datasets.
We aim to constrain DOC concentrations on a global scale and thus not site-specific. A general aspect is that on such temporal scales integrated C cycle fluxes may be strongly related to 'average' biophysical conditions like in many other large-scale models such as Century [64] and LPJmL [65][66][67]. This makes empirical relationships between DOC concentrations and environmental conditions useful for bridging the gap between site and landscape scales. Thus, from the database we selected those factors with a global data coverage and therefore had to ignore some soil properties and environmental factors with potential predictability. Instead we use, beside climate and soil data extracted from grid, factors such as soil class, which often capture overall environmental conditions that determine soil forming factors and the resulting soil physical and chemical characteristics [4,68], and land cover/vegetation [32,37,[69][70][71], as a proxy for the soil C cycle, and climate zones representing temperature and moisture conditions [68,72]. In other studies similar generalized controlling factors have been proposed, such as "biological activity" [49], "microbial metabolism" [73] or a "physico-chemically dominated control" [45]. In that respect a C model using 'average' biophysical conditions is consistent with other global C models that use biome type as a basis for modelling global soil respiration [74] or climatic life zones for estimating global soil organic carbon pools [75].

Statistical analysis and model construction
All numerical variables were plotted against each other and studied using simple descriptive methods (pearson R, spearman R, mean, median, standard deviation), for all classifications such as land cover type, top/sub soil or soil type. The effect of the type of sampling method is also analysed. Several transformations were examined for all C data (ln, log10, square root, square, reciprocal). DOC concentration data by far have the largest global coverage (Table 1). Therefore, only yearly DOC concentration data were selected for the construction of the global model. We used Box plots, QQ-plots and histograms to present and analyse frequency distributions and to identify outliers. Because most data are not normally distributed, the Mann-Whitney U test was used to compare differences between subgroups. We constructed different models to calculate the annual average DOC concentrations in topsoil and subsoil.
The topsoil model is constructed in a combined forward/backward multiple-linear regression analysis using the 'step' function in R [76], correcting with the Akaike information criterion (AIC) for the number of factors included. Covariate non-linearity was examined using partial residual plots (PRP), taking first the variables that show the strongest non-linearity, following [77]. Added variable plots (AVP) were used to identify outliers (isolated largest residuals or largest partial leverages) [78] (See SI 7). The model performance was expressed by the coefficient of determination (R 2 ) and the root mean square error (RMSE).
The subsoil model was constructed as a relative attenuation function with depth starting from the topsoil concentration. Codes for data processing, regression analysis and model construction were developed in R or Python.

Results and Discussion
This chapter contains four sections. In the first Sect

General aspects and data selection
The database contains mainly data for annual average DOC concentrations and fluxes (72% and 37% of the database entries), with only limited data for DIC (< 11%) ( Table 1). DIC data generally show concentration/flux profiles as expected, increasing with depth [8,47,51,106] due to less efficient exchange of CO 2 with the atmosphere [107,108]. However, the DIC data are scant, which inhibits analysis and model development on a global scale. Therefore, we selected only DOC data for further analysis.
The number of DOC concentration data clearly exceed that of DOC flux data. More importantly, the global coverage of DOC concentration data is much larger as it covers all main climate zones, unlike the DOC flux data. This is of key importance, as we aim to assess dissolved C on a global scale. Therefore our further analysis focuses on DOC concentrations (Table 1).
We analysed the data according to the consistent topsoil-subsoil classification following [57]. Annual average DOC concentrations range from 1.7 to 88.3 (median = 24.43) mg C L −1 for topsoils and from 0.42 to 372.1 (median = 6.65) mg C L −1 for subsoils (excluding laboratory incubations). When excluding data from Histosols, median values are 25.27 and 5.50 mg C L −1 for topsoil and subsoil respectively. DOC concentrations decrease with depth from the organic rich topsoils down to the subsoils (Fig. 2). Earlier, smaller dataset compilations from forests by Dalva and Moore [46] and Michalzik et al. [24] show similar frequency distribution profiles and concentration ranges. Both the whole data set and horizon sub-sets are positively skewed distributed ( Fig. 2 and SI 3), similar to  [57,60].
A range of different techniques has been used for measuring DOC concentrations. Laboratory experiments yield significantly higher concentrations in both topsoils ( p < 0.001 , Fig. 3) and subsoils ( p < 0.001 ) than other methods, with five times higher median concentrations in topsoils. This can be attributed to soil sample disturbance and different conditions, that are often not representative for field conditions [32,[109][110][111][112][113]. In a review study, [48] conclude that relations identified in laboratory studies can often not be confirmed in the field, in particular for subsoils [48]. Therefore, we used data from the laboratory experiments only for analysing the relative concentration changes with depth in subsoils, for which we the use retained fraction compared to the topsoil.
In subsoils, piezometer-based concentrations are much higher than for all other sampling methods ( p < 0.001 , Fig. 3), as these samples are almost exclusively taken in Histosols, where DOC degradation is inhibited by high groundwater tables [85]. Where Zsolnay [114] and Buckingham et al. [56] found different concentrations between zero-tension and tension lysimeters [27], we did not observe this in neither top-( p > 0.1 ) or subsoil data ( p > 0.1 ). Thus, both methods are combined in the analysis [30,115].
For either top or subsoils, no single potential numerical driving factor showed a clear correlation with DOC concentrations in a simple (linear) regression (maximum R 2 = 0.13 ; see for example Fig. 4). This suggests that DOC concentrations are not driven by one uniform first-order governing factor, but by a set of interrelated drivers and controls that are spatially variable [32,91]. For example [23] showed for temperate forest soils that both DOC production and heterotrophic respiration (HR) increase with temperature and also soil C/N ratios [23,116].
Measured temperature and precipitation show a good correlation to corresponding climate data (R 2 of 0.94 and 0.49). As the measured data are not available for all measurements, we used the temperature and precipitation data from [61] instead of the measured values in our further analysis.
Concentrations of DOC in top and subsoils differ by soil class, land cover type and climate zone (Fig. 5). In contrast to other soil classes, Histosols generally contain higher DOC concentrations in the subsoil compared to the topsoil [53,85,117]. This has a clear confounding effect on some of the classified data, e.g. for the land cover type 'grass agriculture' , which seem to have higher median concentrations in subsoils than topsoils (Fig. 5). However, when we exclude the Histosols, a more consistent pattern is shown (SI 3), with concentrations clearly decreasing with depth.

Database
For the topsoil (classification following [57]), 255 entries on DOC concentrations are included in the database (excluding 10 laboratory incubations). The specific horizon within the topsoil being studied can impact the amount of DOC measured. [23] identifies that in a range of forest floor samples, 74% of the DOC is from the organic horizon, 12% from litter and 13% from the deeper roots, emphasizing the role of the upper layers in DOC production. For our topsoil data, O, A and O/A horizons are dominantly present, with lowest DOC concentrations in the A horizon (Fig. 6). Despite being often from saturated soils, values from Histosols do not differ significantly from the other topsoil data ( p > 0.1 ), so do not have to be filtered out in a further topsoil data analysis. Since B horizons are commonly subsoils we excluded them from the analysis of topsoil DOC concentrations.
All main climate zones are covered in the data, although there is a bias towards temperate and continental climates (Fig. 5), an issue recognized in earlier studies [32,34]. Highest concentrations occur in humid continental climates and lowest in semi-arid climates. Also, oceanic climates have higher values than the three tropical climates ( p < 0.001 ). Generally, we observe higher median concentrations in more moderate climates, with some of the warmer regions having lower concentrations ( Fig. 5 and SI 3). This could be explained by the lower decomposition rates in temperate zones, caused by sub-optimal conditions for microbial degradation [23,49,72,73]. Further, a limited or absent litter layer in semi-arid, tundra, and savannah climates restricts the amount of organic C available for decomposition. Where some studies identified different DOC concentrations between coniferous and deciduous forests [70,71], this was not observed at the global scale ( p > 0.1 ), despite the large number of data entries (Fig. 5), consistent with observations by Michalzik et al. [24]. We therefore aggregated the forest data into one main class (SI 1), as we can use climate zones to account for the climatic impact on the tropical and montane forest data.

Model
The topsoil multi-regression model was constructed involving the variables which are available for all DOC database entries. Therefore, factors like soil texture, C/N ratio or pH could not be included. On the basis of the data entries in the database (244 entries used for the topsoil model) and using the AIC to select the best fitting model, four factors are included in the model for calculating DOC concentration in the topsoil soil solutions: where DOC top is the DOC concentration in topsoil soil solutions (mg C L −1 ), coef CZ , coef SC and coef LU are the coëfficents for respectively main climate zones, soil class and main land cover groups (Table 3), and P ann is the annual average precipitation (mm year −1 ) [61]). 31% of the variation is explained by the model (RMSE = 14.9, RSE = 15.5, 242 degrees of freedom). Neither including sub-classes for climate and land cover, nor transforming factors to non-linear functions yielded better model performance. Including soil data extracted from soil databases in the multi-regression analysis did neither yield a higher model performance.

Fig. 5
Top and subsoil DOC concentration (mg C L −1 ) data distributed over a USDA soil classes, b sub-climate zones according to [54] and c land cover groups. Laboratory incubation data are excluded, n = 533. Boxplot bars are medians. Whiskers extend up to 1.5 times the interquartile range (IQR) from the lower and upper quartile, unless exceeding the minimum or maximum.
Circles are values exceeding the whiskers. Two subsoil values (up to 372 mg C L −1 ) from Histosols are not shown, but included in the distribution. In a, data for soils with a double or mixed classification is represented in both single classes. See SI 3 for similar distributions as for b, c, but excluding Histosols model tends to overestimate lower concentrations for temperate and continental zones, while higher concentrations (> 50 mg C L −1 ) are consistently underestimated. The main reason is that a wide range of DOC concentrations occurs within a limited range of annual precipitation. Precipitation is negatively correlated to DOC concentrations ( p < 0.001 ; Fig. 4), though the large residuals indicate considerable uncertainty. For a P ann in the range of 700-1000 mm year −1 (mainly continental and temperate climates), DOC concentrations range from 1.7 to 87.6 mg C L −1 with high concentrations corresponding to the high values in continental and temperature climates shown in Fig. 7. The uncertainty may be due to the absence of information on the seasonality of precipitation and the multiple ways in which precipitation may influence concentrations, e.g. temporary accumulation Fig. 6 Topsoil DOC concentration (mg C L −1 ) data distributed over horizon classes (n = 230, 90% of all topsoil DOC concentration data). Laboratory incubation data are excluded. Boxplot bars are medians. Whiskers extend up to 1.5 times the interquartile range (IQR) from the lower and upper quartile, unless exceeding the minimum or maximum. Individual data are jittered for visualization purposes Table 3 Regression coefficients for the topsoil model in equation 1 a Main Köppen-Geiger climate zones [54] b Aridisols are not included in the database. As these soils occur in arid to semi-arid regions with sparse vegetation and very low microbial activity, we assume the DOC concentration to be equal to that in precipitation. For the concentration in precipitation, we calculated the median value from the overview by Aitkenheadsps-Peterson et al. [118]; 1.55 mg C L −1 . For comparison, this is up to 50% lower than the database DOC values, available for a steppe climate, but in oxisols. A fixed value of 1.55 mg C L −1 is also used for ice cover (land use or soil class maps), rock cover, shifting sands, salt plains (soil class maps) or hot deserts (land use maps) c For some DOC data, only classification based on groundwater behavior (non-USDA) was possible in dry periods and flushing in wet seasons, dilution [4,48,119] or via soil moisture content [72,116,120]. Another cause of uncertainty is the poor spatial coverage of some governing factors, with the result that they could not be used for spatially dependent relationships. For example, C/N ratio may be an important factor in the case of Spodosols in temperate/continental climate zones ( R 2 = 0.36 ), consistent with several observations [2,[22][23][24]89].
Finally, we calculated the global distribution of topsoil DOC concentrations using grid data on land cover [62], precipitation [61], USDA soil classes from [59] and climate zones by Kottek et al. [54] (Fig. 8). For areas with Aridisols (soil class), ice cover (land use or soil class maps), rock cover, shifting sands, salt plains (soil class maps) or hot deserts (land use maps) we assume the DOC concentration to be equal to that in precipitation, using the median value from the overview by Aitkenheadsps-Peterson et al. [118] (Table 3). Model results yield topsoil concentrations up to 51 mg C L −1 , with higher values generally in higher latitude continental zones with abundant forests. Lower concentrations occur in the arid zones with limited vegetation. DOC concentrations in equatorial regions, such as rainforests, are generally below ~25 mg C L −1 . In an earlier overview [34] also found DOC concentrations in tropical topsoils to be significantly lower than those in boreal or temperate topsoils (see Fig. 1d [34]). This confirms that the effect of temperature on DOC is site-specific and possibly indirect [89], unlike results from some early forest studies [45,80]. Probably, low tropical topsoil DOC concentrations are caused by optimal conditions for microbial degradation, reducing the amount of DOC before it leaches into the soil [23,49,73]. Topsoil concentrations in Histosols do not exceed those in mineral soils (Figs. 6 and 8), similar to earlier findings [34]. . Entry #220 is also excluded as identified as a high leverage point in a partial regression analysis using an added variable plot (AVP)

Database
For the subsoil, 285 entries on annual average DOC concentrations are included in the database (including data measured in laboratory incubations). Of the reported horizons (not present for all DOC data; Fig. 9), B horizons are dominant, with only few data for O/A horizons present. Except for the limited amount of O/A and C horizon values, the data in the horizon classes have similar distributions and medians. Unlike for topsoils, subsoil DOC concentrations in Histosols strongly differ from other subsoils. Low biodegradation rates due to anoxic conditions, caused by high water tables, as well as the high organic matter content in Histosols result in high DOC concentrations at larger depths [85,117].
When excluding Histosols, higher subsoil concentrations generally relate to higher topsoil DOC concentrations (Fig. 5), whereby the gradients are smaller for areas with low input due to low microbial activity, such as shrublands, tundra or permafrost (Gelisols) [23,49,72]. Soils under crops are a special case, with higher DOC concentrations in subsoils than in topsoils (Fig. 5). A possible explanation is DOC excretion by deeper plant roots [121]. However, data for soils under crops are scarce and may not represent actual conditions for croplands, similar to [83].

Model
DOC concentrations in subsoils are regulated through biodegradation and the balance of adsorption-resorption processes [24,48,50]. Evidence of a strong influence of specific driving factors is mainly derived from laboratory experiments and less evident than for topsoils [48]. As physical-chemical factors are a main control of subsoil DOC concentrations [31,48,50], soil class can therefore be used as a proxy for physical-chemical conditions [100]. In addition, soil classes also partly account for differences in hydrological flow paths [86]. Several field studies identified differences in subsoil DOC concentrations between USDA soil classes [4,68,85,86].
We calculate DOC concentrations in the subsoil as relative concentrations compared to the topsoil. In other words, the relative concentration in the subsoil is the quotient of a concentration at depth x relative to the topsoil concentration. Assuming a constant attenuation with depth within a soil class, an exponential decay function can be fitted for each soil class. Figure 10 shows the example for Spodosols, with the corresponding coefficients for all classes in Table 4. In  (Table 4). With a known topsoil concentration for a corresponding soil class, the DOC subsoil concentration at depth x is calculated as follows: where DOC sub is the DOC concentration at depth x in the unsaturated zone, DOC top is the topsoil concentration as measured or modelled by equation 1, and coef SCsub is the soil class-dependent decay coefficient (Table 4).
For mineral soils, R 2 values of the regressions range from 0.23 (Oxisols) to 0.74 (Alfisols and Spodols), up to 0.96 (Vertisols, but limited data) ( Table 4). For some soil classes, in particular Vertisols, Gelisols and Andisols, the analysis is based on a limited dataset (Table 4) reflecting the small area covered by these soils (Vertisols 2%, Andisols 1%) or the permanently frozen state of the soil (Gelisols) [58,59]. Except for Histosols and Oxisols, the functions describe patterns of the relative concentrations vs. depth quite well with no consistent overestimation nor underestimation for undeep soils. For deeper subsoils (> 1.0 m) the model seems to consistently underestimate DOC concentrations, for example due to subsoil biodegradation [48,50], DOC production or exudation by plant roots [121] or DOC release due to desorption from soil colloids [27,31,99,122]. In addition, the degradation of DOC could be overestimated by ignoring increasing recalcitrance of the organic C with depth [123].
Data and profiles for Oxisols and Histosols clearly differ from the other soil classes (SI 5) because in the strongly weathered Oxisols relatively high concentrations occur at greater depths due to high water percolation rates [86]. In Histosols, relative concentrations are observed to be both above (72% of the data) and below (28%) 1.0, with a very low R 2 (0.01), depending on the depth of the water table (see footnote c in Table 4).
Using Eq. 2, the modelled topsoil DOC concentrations (Fig. 8), k-values (Table 4) for USDA soil classes from [59] we calculated the global distribution off subsoil DOC concentrations at a depth of 1 m (Fig. 11). Modelled DOC concentrations at this depth range up to 266 mg C L −1 , though values above 20 mg C L −1 only occur in areas with abundant Histosols. Highest subsoil concentrations in non-Histosols are between 5 and 10 mg C L −1 and mainly found in permafrost areas (Fig. 8).
The high concentrations in subsoils of Histosols are consistent with observations by Camino-Serrano et al. [34], who attributed the high DOC concentrations in subsoils to the low hydraulic conductivity in these soils. Histosol concentrations, in particular in subsoils, dominantly depend on the water table depth [53] and therefore the thickness of the oxic zone, and can thus both increase or decrease with depth [85,117]. High concentrations in subsoil Gelisols are similar to observations of [124,125] and caused by reduced DOC degradation due to low microbial activity in the permafrost [126].
Our subsoil model uses only a single factor, soil classes to estimate subsoil DOC on a global scale. Although it is of course only a simple model aimed to make a first estimate, one should note that other factors such as climate effects and vegetation are inherently included in the model. For example, Oxisols and Ultisols are typical subtropical and tropical soils [86], while Aridisols are typical for (semi) deserts with generally low vegetation coverage [59,62]. Thus, the soil type not solely represents soil characteristics, Table 4 Regression coefficients for the subsoil model in equation 2, based on the nonlinear (weighted) least-squares estimates a n= amount of database values of this class for which a relative reduction can be calculated. RMSE = root-mean-square error, measure of the differences between values predicted and observed. p < 0.01 , except for Gelisols (< 0.1) and Histosols (see c ) b Average coefficient of regression on the data, excluding values in peat. Mixed soil classes including a Histosol are included c Out of all subsoil Histosol data for which a relative quotient with depth can be calculated, 28% has a relative reduction below 1.0. We therefore conducted the regression over all data, resulting in a very low positive coefficient, causing a quasi-linear positive relation. The p value is quite high (0.11) due to the large spread of both positive and negative data d Insufficient subsoil data in Mollisols is available in the database; the average coefficient without Histosols is therefore used. n.d. means no data in the database available e The USDA soil class map contains also the classes salt, shifting sand, rock and ice. Rock and ice are assumed to have no subsoil (NaN). For shifting sand and salt, the average coefficient without Histosols is used. n.d. means no data in the database available USDA soil class k-value n a RMSE a R b but may indirectly also incorporate the effects of environmental and climatic conditions on DOC. As such, the subsoil model, like the topsoil model, may be viewed as a simplified but effective first step in estimating DOC concentrations on a global scale.

Limitations, application and perspective
Our global database on dissolved C in soil solutions does not yet include some other factors that have been identified in the literature as potential controls of DOC. For example, cation exchange capacity (CEC) [127], terrestrial acid deposition [128], anion deficit [97,129] , soil specific surface area [122] or the composition of the DOC [52,130]. Still, only few studies incorporated in our database include a broad range of parameters, measured in a similar way. To enable further in-depth analysis and to include additional process controls on a global scale, a standardized set of ancillary data and uniform sampling method is required. The ICP Forests program set up such a framework for monitoring in European forests [131], which enabled recent in-depth analysis and model construction on this sub-continental scale [e.g., 34,44,43,128,132].
Our model simulates DOC concentrations at the global scale, which is a major step forward compared to current, recent large-scale (country, large basin, sub-continental region) models [e.g. 17,41,42,43]. Though the temporal scale is 1 year, temporal downscaling of the yearly average modelled DOC concentrations could be based on relative seasonal variability [e.g. 128].
This model will be included (Langeveld et al. in prep.) in the Integrated Model to Assess the Global Environment (IMAGE) Dynamic Global Nutrient Model (DGNM) [19]. When hydrology [e.g. 133,134] is known, we can attempt to model the global DOC fluxes and also constrain the C fluxes from terrestrial to aquatic systems on a global scale.

Conclusions
We present the first global database on annual average DOC and DIC in soil solutions, covering all main climate zones. As data on DIC are scant, we conducted our analysis and model construction on annual average DOC concentrations. Highest topsoil DOC concentrations occur in forests in humid continental climates, while topsoils of Histosols do not have higher DOC concentrations than other Fig. 11 Modelled global subsoil DOC concentrations in soil solutions (mg C L −1 ) at a depth of one meter. Based on equation 2 using topsoil data from Fig. 8 soil classes. In contrast, highest concentrations in subsoils occur in Histosols. Our analysis shows that DOC concentrations are controlled by a complex of processes that vary in space. DOC concentrations from laboratory experiments are consistently higher than values found in the field.
We identified a set of four indirect controls of global topsoil DOC concentrations, i.e. precipitation, climate zones, vegetation types and soil classes. Further, our analysis showed that global subsoil DOC concentrations vs. depth can be modelled for all USDA soil classes. Here, soil class represents generalized physico-chemical properties that are not represented when only climate or land cover are used. Future sampling studies on DOC should be conducted in regions with land cover types currently underrepresented, such as crops, preferably over different soil classes. A standardized set of ancillary data and uniform sampling method would enable further constraining of global dissolved C concentrations in soil solutions.

Supplementary material and data availability
Supplementary material, including the database, is publicly available on PANGAEA (https ://www.panga ea.de/). We are grateful to Marta Camino-Serrano (Centre for Ecological Research and Forestry Applications, CREAF) for providing us with a list of several studies to include in the database, which we acknowledge her by mentioning it in the database references concerned. We further thank Peter Janssen (PBL) and Maarten Zeylmans Van Emmichoven (Physical Geography -UU) for assisting in the analysis design and data extraction from maps respectively.

Acknowledgements
Author contribution JL, AFB, AHWB and JJM designed the study. Database compilation was conducted by JL. Code writing and data extraction from grid was done by JL, AHWB, JMM and LV. All authors contributed to the analysis and discussion of the results. JL wrote the manuscript. AFB, LV, JJM contributed to the revision of the manuscript.

Compliance with ethical standards
Conflict of interest The authors declare no conflict interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.