Introduction

Peatlands are significant global ecosystems, encompassing approximately 3% of the Earth’s land surface. These ecosystems are crucial carbon reservoirs, storing a substantial amount of soil organic carbon, estimated to be between 450 and 650 Pg (Yu et al. 2010; Loisel et al. 2017; Jackson et al. 2017). Due to their substantial carbon stocks, peatlands play a vital role in the global carbon cycle and are among the largest terrestrial carbon pools. Peatlands have been used for agriculture (e.g., grazing) for over a millennium, increasing heavy drainage, currently raising international concerns. In Germany, for instance, most peatlands no longer work as carbon sinks due to agricultural activities (Tiemeyer et al. 2016, 2020).

Peatlands differ from other ecosystems because of waterlogged conditions, very peculiar vegetation composition (e.g., sedge, reed, and cattail plants), and surface peat layer (Gorham 1991; Lees et al. 2018). Although there are several types of peatlands, they all have the common feature to accumulate a significant volume of organic carbon over a long time period (Minasny et al. 2019). Therefore, quantifying the soil organic carbon stocks (SOCstocks) in peatlands is crucial. However, mapping the distribution of peatlands and their SOCstocks is a tough task (Koszinski et al. 2015; Minasny et al. 2019). Topography, restricted sampling depths, small fragments of peats worldwide, nutrient availability, land use history, and carbon biogeochemical dynamics are the foremost factors that hinder estimation of SOCstocks.

Proximal sensing in geophysical and soil studies through electromagnetic induction (EMI) instruments, which retrieve apparent electrical conductivity (ECa), have been a useful tool to estimate peat thickness and SOCstocks (Saey et al. 2012; Koszinski et al. 2015; Huang et al. 2016). Moreover, ECa information has proven to be highly correlated with peat properties, such as water content, dissolved ions content, and decomposition stages (Comas and Slater 2004; Walter et al. 2015), which are essential data to estimate SOCstocks. ECa has the potential to improve the SOCstocks predictions in soils, mainly peatlands, corroborated by some studies (Comas and Slater 2004; Kettridge et al. 2008) that presented the geophysical methods to quantify peat thickness and SOCstocks.

The main challenge is to provide the most efficient spatial extrapolation of ECa and use it as an environmental variable to improve the predictive power of SOCstocks models. Although there have been several studies on ECa mapping using geostatistical methods (Sun et al. 2012; Saey et al. 2012; Altdorff et al. 2016; Zhang et al. 2020), the use of machine learning methods, such as random forest, for this purpose is relatively scarce in the soil science literature. As evidence of this, we identified only one study in the literature (Taghizadeh-Mehrjardi et al. 2014) that evaluated the creation of digital maps of ECa using regression kriging, a hybrid geostatistical method that combines elements of regression analysis (e.g., machine learning) and kriging to improve spatial predictions of the variable of interest (Keskin and Grunwald 2018).

Digital soil mapping (DSM) uses environmental variables that explain the response variable behaviour in the landscape to map it. To date, few studies have investigated the use of DSM framework to map ECa data from laboratory and EMI. For example, Yang et al. (2019) compared multiple linear regression, geographically weighted regression, mixed geographically weighted regression and the regression kriging of residuals of three algorithms to predict soil electrical conductivity data from laboratory. Wu et al. (2018) also tested three machine learning methods, such as support vector machine, multiple linear regression, and random forest to predict ECa data from EMI. Although those authors utilised ECa data from EMI, they did not collect data following the conventional field sampling design required for geostatistical applications, nor did they incorporate the final predicted ECa maps as environmental variables for modelling other soil properties. Moreover, their analysis was limited to up to 30 spatially distributed samples within their study site.

DSM framework also requires a good source of environmental variables to better explain spatially the response of the variable behaviour in the landscape. Remote sensing (RS) data from satellites is the main provider of environmental variables for DSM (McBratney et al. 2003; Hartemink et al. 2020; Thompson et al. 2020). Satellite images from Landsat, Sentinel, RapidEye, and other missions provide the soil multispectral response and vegetative targets on the Earth surface allowing to calculate their representative indices or use their individual bands separately in different spatial scales (Grinand et al. 2017).

Spectral indices based on satellite imagery, such as brightness index, normalised difference vegetation index, red-edge normalised vegetation index, enhanced vegetation index, and redness index proved to be more effective than use the individual bands separately to predict SOC (Lamichhane et al. 2019), because vegetation is positively correlated to SOC. The authors also highlighted the importance of other RS data, such as digital elevation models (DEM) and their terrain derivatives, because these derivatives can retrieve potential depositional, erosional, moist, and dry areas, which are highly correlated to SOC. Therefore, RS imagery, DEM, and terrain derivatives are recommended to be applied as environmental variables to map SOC.

Spatial resolution of RS data is one of the main factors to increase modelling performance of SOC. Wiesmeier et al. (2019) conducted a comprehensive literature review on different factors that may affect SOC quantification and spatialisation at different scales. High resolution RS data can provide more detailed and effective information of SOC dynamics in the soil surface and subsurface. Consequently, the quality of environmental variables influences the accuracy of SOC prediction (Miller et al. 2016). For example, Forkuor et al. (2017), investigated the use of high resolution data to map SOC spatial distribution and stated that detailed RS data plays an important role in SOC modelling.

There are no reports in the literature of the use of ECa data from EMI with high resolution RS data and machine learning method (e.g., random forest) or universal kriging for DSM. This means that the geostatistical approach at local scale is still the usual procedure to produce high-resolution ECa maps from field survey data. Unlike other geostatiscal methods, the universal kriging method allows kriging in the presence of strong trends in the sample data using available secondary information (i.e., environmental variables) at all prediction locations (Trangmar et al. 1986). Nevertheless, the model parameter fitting is linear due to the strict assumptions of the universal kriging (UK) model. The machine learning methods, such as random forest (RF), work differently from the geostatiscal approach because they do not require strict model assumptions. In this sense, RF can handle non-linear and linear relationships between the response and environmental variables, which allows classifying it as a nonparametric method. RF is also a well-established machine learning algorithm in digital soil mapping (Khaledian and Miller 2020; Wadoux et al. 2020).

Based on these considerations, we hypothesised whether the use of the machine learning method and UK through DSM framework can be viable to produce ECa maps from field surveys and then use ECa maps as an environmental variable to improve the predictive power of SOCstocks models in peatlands. To test the hypotheses, we evaluated three possible scenarios combining different sets of environmental variables with the modelling technique to map ECa. Defined as: Scenario 1, only spectral indices calculated from RapidEye satellite collection and random forest (final map named as ECa1); Scenario 2, spectral indices calculated from RapidEye satellite collection combined with terrain derivatives from LiDAR sensor and random forest (final map named as ECa2); and Scenario 3 spectral indices calculated from RapidEye satellite collection combined with terrain derivatives from LiDAR sensor and UK (final map named as ECa3).

Afterwards, we assessed the efficacy of adding ECa maps as environmental variables to produce SOCstocks maps. Therefore, we incorporated the ECa maps produced (i.e., ECa2 and ECa3) with the environmental variables from Scenario 2 and random forest for DSM of SOCstocks as follows: (a) only the environmental variables from Scenario 2, (b) ECa2 with the environmental variables from Scenario 2, and (c) ECa3 with the environmental variables from Scenario 2.

Materials and methods

Study site and main landscape features

The study site is part of the Havelländisches Luch (NE Germany) and covers 26.54 km2 in the municipalities of Wiesenaue, Paulinenaue, and Fehrbellin (Fig. 1). It is also in the major geographical region of Northern Lowland, formed during the Pleistocene era resulting from sub-, peri-, glacio-fluvial and post-glacial geomorphological processes. Topographically, the study site is characterised by flat relief (Fig. 1a), which contributes to the formation of groundwater dependent wetlands or ecosystems. Peatland (water-rise mire) is the main type of wetland in that region (Mueller et al. 2007) and the average groundwater levels are near the surface up to 0.3 m. This area is highly heterogeneous due its parent material (e.g., fluvial sediments, outwash, sand bars, dunes, and alluvial fans) and land management, mainly the drainage history. The mean annual precipitation and air temperature are 550 mm (interannual variation 340–950 mm) and 9.5 °C, respectively. The most common soil groups are Arenosols, Gleysols, and Histosols (IUSS Working Group WRB 2015) used for arable land, meadows, and pastures.

Fig. 1
figure 1

Study area (©Google Maps 2019) displaying the Digital Elevation Model from LiDAR and collected soil cores (a), ECa calibration data representing 80% of the total samples (b), ECa validation data representing 20% of the total samples named validation 01 (c), and additional validation data outside the total data named validation 02 (d) from 2010. Validation 01 involves cross-validation (e), while Validation 02 employs hold-out validation (f), as exemplified by zooming in on Farm 02 and 05

Soil and apparent electrical conductivity data

Data on total SOCstocks were acquired from 49 soil cores (Fig. 1a) and stratified random sampling was the selection procedure of those points, as described in Koszinski et al. (2015). The soil samples were collected using a hydraulic probe (2 m long and 10 cm in diameter), air-dried, ground, and sieved (< 2 mm mesh). Only the fraction < 2 mm was used for further analyses. An aliquot of the air-dried < 2 mm fraction was oven-dried at 105 °C (24 h) to obtain the water contents of the air-dried samples. The total C content was determined using the elemental analysis (CNS analyser TruSpec, LECO Ltd., Mönchengladbach, Germany) as CO2 through infrared detection after dry combustion at 1250 °C (DIN ISO10694 1996). The gas-chromatographic analysis of CO2 evolution (Carmograph by Woesthoff, Scheibler-method) was used to determine carbonate C after an application of phosphoric acid. The subtraction between total and carbonate C results in the soil organic carbon content (SOCc). All data are presented on an oven-dry basis.

Steel rings of 100 cm3 were used to collect soil samples from 103 out of 282 horizons (49 soil profiles). Afterwards, bulk density (Db) was determined using the thermogravimetric desiccation at 105 °C (DIN EN ISO 11272 2014) in the lab. The remained 179 missing values of Db were calculated using a pedotransfer function (e.g., relationship between measured Db and SOCc), as described in Koszinski et al. (2015), and not collected because they were inaccessible (groundwater levels or disturbed by the sampling). Thus, SOCstocks for each soil core was calculated to a 1 m depth following the methodology and formula described in Schlichting et al. (1995):

$${SOC}_{d}=\sum _{i=1}^{N}{SOC}_{c}\times t\times {D}_{b}\left[\left(100-Gr\right)/100\right]$$

Where: SOCd is the soil organic carbon density (SOCstocks, kg m− 2), SOCc is the soil organic carbon content (wt%), t is thickness (cm), Dd is bulk density (Mg m− 3), and Gr is the mass percentage of the fraction > 2 mm (gravel, wood fragments etc.) of a specific horizon.

Apparent electrical conductivity (ECa) was measured using an electromagnetic induction sensor with the minimal contact EM38DD device (Geonics Ltd., ON, Canada) in April (farms 01, 02, 03, 06, and 07), and July – August (farms 04, and 05) of 2010. The device retrieves ECa in two dipole sets, horizontal (ECa_h, 0.75 m), and vertical (ECa_v, 1.5 m) (Abdu et al. 2007). In our study, we selected ECa_v, described as ECa throughout this article, because it has the same depth information as calculated SOCstocks. The effect of soil temperature was measured at a 0.1 m depth to calibrate the device and standardised at 25 °C (Brevik et al. 2004). ECa data were collected in eight different fields with tracking lanes of 10–15 m apart and within lanes of 1–3 m (Koszinski et al. 2015). The total number of samples was 59,759. ECa data from 2010 was used to calibrate (Fig. 1b) and validate (Fig. 1c and d) the models.

Environmental variables

Eight high-resolution environmental variables were generated. Four variables derived from an airborne high-resolution Light Detection and Ranging (LiDAR) and four from RapidEye satellite images to model ECa (Table 1). The Digital Elevation Model (DEM) was retrieved from LiDAR in December 2008. The basic outputs to generate the DEM were four points per square meter at a 50,000 Hz pulse frequency, and vertical accuracy of 3 cm in the z-axis. DEM was generated using the scattered points by aggregating on a 1 × 1 m grid and resampled to 5 m by the nearest neighbour method. Three terrain derivatives were calculated through QGIS (QGIS Development Team 2020) using the DEM. The first was the Topographic Position Index (TPI) in which positive values characterise locations higher than the mean of their surroundings, while negative values stand for locations lower than their surroundings. TPI near or equal zero are either constant slope areas or flat areas. TPI is an appropriate relief environmental variable, as it is an intrinsically scale-dependent phenomenon. The other derivatives were the slope and topographic wetness index (TWI), which represent the steepness and possible accumulated water at each pixel of the DEM, respectively. The derivatives were selected because of their high correlation with ECa and SOCstocks (Koszinski et al. 2015).

Table 1 Environmental variables retrieved from Light Detection and Ranging (LiDAR) and RapidEye satellite

The RapidEye orbit is composed of five satellites with identical sensors launched in 2008. Each satellite has five bands at a spatial resolution of 6.5 m resampled to 5 m during pre-processing: blue (440–510 nm), green (520–590 nm), red (630–685 nm), red-edge (690–730 nm), and near-infrared (NIR, 760–850 nm). The RapidEye data used in this study were retrieved from the RapidEye Sciences Archives (RESA) for 3rd of June, 27th of June, 3rd of July, 16th of July, 21st of August, and 4th of October 2010, consisting of a 3 A level ortho products (cloud cover < 5%). This means that the satellite data were radiometrically and geometrically corrected. Finally, we used the FLAASH algorithm (ENVI V. 4.4.) for atmospheric correction. Scattered clouds and respective shadows were removed manually. Moreover, we removed artefacts, such as roads, forest, and urban areas, using vector layers provided by the “Amtliches Topographisch-Kartographisches Informationssystem” (ATKIS). The normalised difference vegetation index (NDVI), red-edge normalised vegetation index (RENDVI), enhanced vegetation index (EVI), and brightness index were calculated from the RapidEye data for each of the specified dates (Table 2).

Table 2 Vegetation indices selected for this study site

Modelling

In this study, we evaluated random forest (RF) and universal kriging (UK) algorithms that could satisfactorily deal with ECa and SOCstocks dataset. RF can handle non-linear and linear relationships between response and predictor variables, which classifies RF as a nonparametric method. RF is also a well-established algorithm in machine learning for DSM (Khaledian and Miller 2020; Wadoux et al. 2020). RF trains bootstrap samples of the data using a large number of individual tree models and RF main tuning parameters are the number of variables available for selection in each split (mtry) as well as the number of trees (ntree) (Breiman 2001; Houborg and McCabe 2018). Unlike other geostatiscal methods, UK allows kriging in the presence of strong trends in the sample data using available secondary information (i.e., environmental variables) thus UK can effectively derive prediction uncertainties through kriging variance handling both the variogram modelling of the residuals and regression model together. In our study, the predictions using UK were made on a regular 5 m x 5 m grid. The UK prediction variance simplistically become (Trangmar et al. 1986; Christensen 2011):

$$Z\left(x\right)={\beta }_{0}+\sum\limits_{i=1}^{p}{\beta }_{i}{f}_{i}\left(x\right)+e\left(x\right)$$
(5)

Where: \(Z\left(x\right)\) represents the predicted value at location \(x\), \({\beta }_{0}\) and \({\beta }_{i}\) stand for unknown regression coefficients, \({f}_{i}\left(x\right)\) denotes the environmental variables, and \(e\left(x\right)\) indicates a normally distributed residual with zero-mean and constant variance c\(\left(0\right)\). The residual \(e\) may exhibit spatial autocorrelation, which is quantified through a variogram.

Therefore, three possible scenarios were generated by employing distinct sets of environmental variables along with a modelling technique to map ECa. Scenario 1 involves solely spectral indices calculated from the RapidEye satellite collection and random forest; Scenario 2 incorporates spectral indices from the RapidEye satellite collection in combination with terrain derivatives from LiDAR sensor data and random forest; and Scenario 3 comprises spectral indices from the RapidEye satellite collection along with terrain derivatives from LiDAR sensor data and universal kriging. We applied RF to model ECa and SOCstocks data, while UK was applied to model ECa data only. For RF models, the calibration data for model fitting were set by using a ten-fold cross-validation method, executed ten times, to avoid the effects of spatial autocorrelation between data points (Wadoux et al. 2021) through the “caret” R package (Kuhn 2008). The RF tuning hyperparameters were all set to their default value by taking into account the lowest root mean square error (RMSE) and highest model efficiency coefficient (MEC) in the “caret” R package. The UK model was performed using the “automap” R package (Gianola et al. 2011). The model and semivariogram are described in Fig. S1.

Evaluation metrics

An exploratory analysis was performed by calculating the Pearson’s correlation coefficient (r) between the response and environmental variables. The interpretation of r was based on the following categorisation: low or weak correlation for − 0.35 ≤ r ≤ 0.35, modest or moderate correlation for 0.36 ≤ r ≤ 0.67 and − 0.67 ≤ r ≤ -0.36, and strong or high correlation for 0.68 ≤ r ≤ 1 and − 1 ≤ r ≤ -0.68 (Taylor 1990).

The modelling calibration and validation for ECa were performed using the 58,481 observations from 2010. The conditioned Latin Hypercube Sampling (cLHS) was used to split the data in calibration (80%, 46,784 observations)(Fig. 1b) and validation (20%, 11,697 observations)(Fig. 1c and e) based on remote sensing data through the “cLHS” R package (Roudier et al. 2012). According to Minasny and McBratney (2006), the cLHS approach is more effective to reproduce the distribution of variables than random sampling and equal spatial strata. From here onwards, the 20% validation data is referred to as “validation 01” (Fig. 1e) because we collected an additional validation data of 1,278 observations, which were collected at a farm (i.e., farm 05) outside the area where the 58,481 observations were retrieved, but inside the study site (Fig. 1d and f). From here on, this second validation dataset is referred to as “validation 02” (Fig. 1f). Validation 02 was excluded from the calibration data to prevent overoptimistic conclusions (i.e., Farm 05). All dataset were collected in the same year to remove any doubts on this proposed modelling framework.

For the calibration of SOCstocks models using RF, the data from farms 01 to 06 were used by applying the ten-fold cross-validation method, executed ten times, selecting the model that provided the lowest root mean square error (RMSE) and the highest model efficiency coefficient (MEC) (Janssen and Heuberger 1995). In order to validate and avoid overoptimistic conclusions, SOCstocks data from farm 07 were not included in the calibration and were used as external validation due to the limited number of samples.

The RMSE, MEC, concordance Lin’s concordance correlation coefficient (CCC) and Bias were the selected metrics of model assessment using the validation data, as follows:

$$RMSE=\sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}}$$
(6)
$$MEC=1-\frac{{\sum }_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}}{{\sum }_{i=1}^{n}{({y}_{i}-{\mu }_{obs})}^{2}}$$
(7)
$$Bias=\frac{1}{n}\sum\limits_{i=1}^{n}({\widehat{y}}_{i}-{y}_{i})$$
(8)
$$CCC=\frac{2\rho {\sigma }_{pred}{\sigma }_{obs}}{{\sigma }_{pred}^{2}+{\sigma }_{obs}^{2}+{\left({\mu }_{pred}-{\mu }_{obs}\right)}^{2}}$$
(9)

Where: \(n\), \({y}_{i}\), \({\widehat{y}}_{i}\), \({\sigma }_{pred}^{2}\), \({\sigma }_{obs}^{2}\), \({\mu }_{pred}\), \({\mu }_{obs}\), and \(\rho\) are, respectively, the sample sizes, observed values, predicted values of the response variable, the prediction and observation variances, the means of the predicted and observed values, and the correlation coefficient between the predicted and observed values.

Results

ECa and SOCstocks datasets

The descriptive statistics were calculated for ECa and SOCstocks (Table 3). ECa presented skewness and kurtosis values close to zero, as well as a high coefficient of variation. The skewness and kurtosis values for SOCstocks were also close to 0. The level of dispersion around the mean was higher. The calibration and validation data from 58,481 observations were split using the conditioned Latin Hypercube Sampling for unbiased selection of points to calibrate the models as well as points located in the same areas. This means that there was a set of validation points close to the selected calibration points (Fig. 1e). Nevertheless, we did not include one farm in the calibration data (Fig. 1f) to avoid overoptimistic conclusions in both datasets (ECa, farm 05; SOCstocks, farm 07), representing a field survey after modelling ECa and SOCstocks.

Table 3 Summary statistics

Modelling ECa

The Pearson’s correlation analysis was computed to establish the degree of correlations between ECa and environmental variables to further describe their interrelationships (Fig. 2). ECa presented moderate correlations with the DEM, NDVI, and RENDVI from 16th of July, NDVI, RENDVI, and brightness indices from 21st of August. On the other hand, the remaining environmental variables showed weak correlation with the ECa ranging between − 0.32 and 0.35.

Fig. 2
figure 2

Pearson’s correlation coefficient between the apparent electrical conductivity and the environmental variables in the study area. The variables are represented by abbreviations: ECA_25_V, apparent electrical conductivity from electromagnetic induction instrument; DEM, digital elevation model; TPI, topographic position index; TWI, topographic wetness index; NDVI, normalised difference vegetation index; RENDVI, red-edge normalised difference vegetation index; EVI, enhanced vegetation index. The numbers correspond to the date/month format. For example, NDVI.1607 represents the normalised difference vegetation index for the 16th of July

After generating the models applying the random forest algorithm, we assessed the model performance using validation data 01 and 02 (Fig. 3). Regarding validation 01, Scenario 1 (final map named as ECa1) fitted the ECa model showing RMSE, MEC, CCC, and Bias of 7.15 mS m− 1, 0.95, 0.98, and 0.05, respectively. In Scenario 2 (final map named as ECa2), we combined environmental variables from Scenario 1 with terrain derivatives to predict ECa. This procedure improved the predictive power of the models (RMSE = 6.41 mS m− 1, MEC = 0.96, CCC = 0.98, and Bias = 0.09), effectively observed through the RMSE. Therefore, Scenario 2 decreased the RMSE value by 10% compared to Scenario 1. Scenario 3 (final map named as ECa3) showed better efficacy than the other two scenarios (RMSE = 5.42 mS m− 1, MEC = 0.98, CCC = 0.99, and Bias = -0.01) using spectral indices calculated from RapidEye satellite collection combined with terrain derivatives from LiDAR sensor and universal kriging (Fig. S1).

Fig. 3
figure 3

Graphs of observed and predicted values for the soil apparent electrical conductivity based on both external validation dataset in the study area. Scenario 1, only spectral indices calculated from RapidEye satellite collection and random forest; Scenario 2, spectral indices calculated from RapidEye satellite collection combined with terrain derivatives from LiDAR sensor and random forest; and Scenario 3 spectral indices calculated from RapidEye satellite collection combined with terrain derivatives from LiDAR sensor and universal kriging

Regarding validation 02 (Figs. 1d and 3) and comparing again all scenarios at once, Scenario 3 outperformed Scenarios 1 and 2. We avoided interference of natural factors, such as groundwater level, cover vegetation, and temperature changes, using calibration and validation data from the same period. The final predicted maps were generated for Scenarios 1 (Fig. 4a), 2 (Fig. 4b), and 3 (Fig. 4c).

Fig. 4
figure 4

Predicted maps for soil apparent electrical conductivity (ECa) from electromagnetic induction (EMI) instrument applying scenarios 1 (a), 2 (b), and 3 (c) in the study area

Evaluating ECa maps and their relationship with SOCstocks

The level of correlation through an exploratory analysis between SOCstocks and ECa are shown in Fig. 5. Predicted ECa2 and ECa3 were the only environmental variables that showed a high correlation value (r = 0.78). DEM, the brightness index from 3rd of June, and the brightness index from 4th of October presented respectively r values of -0.67, -0.54, and − 0.44, which mean a modest or moderate correlation. The other environmental variables displayed r values ranging from − 0.36 to 0.33, categorised as having a low or weak correlation with SOCstocks.

Fig. 5
figure 5

Pearson correlation coefficient between the soil organic carbon stocks (SOCstocks) and the environmental variables including the predicted ECa maps in the study area. The variables are represented by abbreviations: SOCstocks, soil organic carbon stocks; DEM, digital elevation model; TPI, topographic position index; TWI, topographic wetness index; NDVI, normalised difference vegetation index; RENDVI, red-edge normalised difference vegetation index; EVI, enhanced vegetation index; ECa2, predicted ECa map from scenario 2; ECa3, predicted ECa map from scenario 3. The numbers correspond to the date/month format. For example, NDVI.1607 represents the normalised difference vegetation index for the 16th of July

Case study: SOCstocks modelling in peatlands using ECa

For case study, we modelled SOCstocks comparing the level of ECa contribution as a new environmental variable, which was achieved by modelling the response variable (e.g., SOCstocks) using only environmental variables from Scenario 2 and these variables including the best ECa predicted maps from Scenario 2 (ECa2) and 3 (ECa3) (Table 4 and Fig. S2).

Table 4 Model assessment of the predicted soil organic carbon stocks (kg m− 2) through Random Forest using hold-out validation data from farm 07

Here, this exercise was aimed at showing the potential use of ECa maps generated by the machine learning procedure to better extrapolate SOCstocks data. To ensure a robust validation of RF model with a limited dataset, we left one area out to avoid overoptimistic results. In this sense, we evaluated the potential of adding ECa information to improve the modelling predictive power of SOCstocks (Table 4). Overall, the best fitted models were obtained by adding ECa information (RMSE: 4.54 and 4.51 kg m− 2) against the model using only environmental variables from Scenario 2 (RMSE: 5.21 kg m− 2). MEC also increased by ~ 83 and 70% for the models combining ECa with environmental variables from Scenario 2. This analysis showed that ECa can improve the modelling predictive power of SOCstocks. Notably, all modelling scenarios exhibited relatively similar variable importance (Fig. 6); however, the addition of predicted ECa significantly changed the magnitude of importance.

Fig. 6
figure 6

Relative contribution values of environmental variables in modelling SOCstocks using Random Forest. The variables are represented by abbreviations: DEM, digital elevation model; TPI, topographic position index; TWI, topographic wetness index; NDVI, normalised difference vegetation index; RENDVI, red-edge normalised difference vegetation index; EVI, enhanced vegetation index; Pred ECa, predicted ECa map from scenarios 2 (ECa2) and 3 (ECa3). The numbers correspond to the date/month format. For example, NDVI.1607 represents the normalised difference vegetation index for the 16th of July

The final maps predicted for SOCstocks are shown in Fig. 7. The need to incorporate geophysics (ECa), as a tool to analyse soils, becomes evident when comparing the approaches in their spatially pattern. These spatial domains show the largest peat thickness thus the most vulnerable landscape elements in terms of drainage and related CO2 source strength. The model strongly overestimates SOCstocks without ECa information in the central part of our study site where the spatial domains are located.

Fig. 7
figure 7

Predicted maps of SOCstocks using only environmental variables from scenario 2 (a), with predicted ECa applying random forest, and (c) predicted ECa applying Universal Kriging

Discussion

The skewness, kurtosis, and SD of the ECa dataset can be considered of normal distribution. For instance, SD in our study was 15% lower than the results found in Koszinski et al. (2015) for the same study site. SOCstocks presented a normal distribution based on the descriptive statistics. The SD value for SOCstocks was also lower (~ 28%) than that described in Koszinski et al. (2015). The normal distribution is a desirable characteristic in the response variable for modelling, which is unlikely to occur in a natural environment (Malone et al. 2013). However, we found a normal distribution for ECa and SOCstocks.

Two validation data were performed for ECa in this study (Fig. 1c and d). The first was selected by splitting the 58,481 observations into 80/20% for calibration and validation, respectively. Validation 01 had observations in the same sites of the calibration set (Fig. 1e). The procedure applied in that case was the conditioned Latin Hypercube Sampling (cLHS), which split observations based on environmental variables for a more effective model training of the response variable (Brungard and Boettinger 2010). Furthermore, cLHS was recommended by Minasny et al. (2019) reviewing the current state of digital mapping of peatlands, requiring further research. Validation 02 (i.e., farm 05) was composed of 1,278 observations (Fig. 1f), which could represent a new field survey after generating ECa predicted maps. Moreover, we avoided unbiased evaluation of the models using the second validation data. This procedure was crucial to represent the real in-situ situation to consistently determine the well-fitted models among the modelling scenarios.

Six environmental variables had the highest correlation coefficient with ECa, although that correlation was considered moderate (Mason et al. 1983). Notably, among these variables, DEM showed the highest association with ECa (Fig. 2). In a similar context, Koszinski et al. (2015) reported a Pearson’s correlation coefficient of -0.69 between ECa_v and DEM. The expected negative correlations between DEM and ECa underscore the need to address these influences when interpreting ECa mapping results (Figs. 3 and 4). The topographical features captured in the DEM can impact water flow, drainage patterns, and soil moisture levels, which, in turn, affect the electrical conductivity of the subsurface. Therefore, low-lying areas (e.g., depressions and valleys) may accumulate more moisture, leading to higher electrical conductivity values as water enhances the conductivity of the soil. Conversely, higher elevations (e.g., hills and ridges) might have lower moisture content, resulting in lower electrical conductivity values.

The assessment of model performance was conducted using both validation data after generating the models. Overall, the best fitted models for ECa were Scenarios 2 and 3, respectively, performing RF (RMSE: 6.41 and 9.77 mS m− 1; MEC: 0.96 and 0.90) and universal kriging (RMSE: 5.42 and 7.09 mS m− 1; MEC: 0.98 and 0.94). Taghizadeh-Mehrjardi et al. (2014) found RMSE of 37.74 mS m− 1 and MEC of 0.49 mapping ECa using the local regression kriging approach and terrain derivatives from Landsat ETM + as environmental variables. Wu et al. (2018) predicted soil salinity (ECa) using Landsat ETM + and ALOS data as environmental variables through RF and Support Vector Machine algorithms. The authors found MEC values between 0.72 and 0.89. Therefore, our results presented better accuracy than most studies conducted to date (Saey et al. 2012; Yang et al. 2019; Zhang et al. 2020). We attribute the high accuracy of our models to high-resolution digital elevation model and its derivatives, as well as the high-resolution multispectral data from RapidEye satellite sensor in the same year of ECa field collection. The final predicted maps of ECa showed high values from the NW-SE direction, which corroborated the geological formation related to glacial processes during the Pleistocene phase (Koszinski et al. 2015). The microrelief features in the region are also clear. The study site is characterised by extensive floodpains and bas-relief; however, there are small spots with high ECa concentration from summit to toeslope in the NE and SW directions. This shows the importance of predicted ECa maps to characterise peat thickness and help to extrapolate information to unknown areas, which can be achieved by applying machine learning and geostatistical methods, such as RF and UK.

The importance and novelty of our procedure to generate ECa maps lie in the fact that ECa is highly correlated with SOCstocks. Altdorff et al. (2016) found the Pearson’s correlation (r) values ranging from 0.4 to 0.3 between ECa and SOCstocks at three different soil peat depths (25, 50, and 100 cm). Koszinski et al. (2015) also achieved r value of 0.79 for information between ECa and SOCstocks. Thus, these studies corroborate our findings and highlight the high potential of ECa to quantify and extrapolate SOCstocks information in peatlands. The high correlation among the mapped ECa2, ECa3 and SOCstocks prove the efficacy of ECa modelling through RF and UK approaches (Fig. 5). Our study demonstrates that integrating spatially explicit ECa maps as engineered variables significantly enhances the accuracy of predicting SOCstocks. This departure from solely relying on base variables highlights the value of exploring new environmental variables to improve model calibration and prediction. Additionally, we observed significant improvements in SOCstocks predictions with the inclusion of ECa data (Fig. S2 and Table 4), further emphasising the pivotal role of geophysical information (Fig. 6). The engineering of new variables, represented by spatially explicit ECa maps, proves to be highly beneficial in predicting the SOCstocks.

Questions about the advantages of using machine learning or universal kriging models may arise. Therefore, performing UK and RF on ECa data allowed mapping the total area of 26.54 km2 rather than only 1.12 km2. ECa data could be extrapolated 23.7 times using RF and UK modelling. This study directs further research into extrapolating and replicating this methodology through high-resolution digital elevation models and their derivatives, as well as the high-resolution multispectral data from RapidEye satellite sensor in the same year of ECa field collection. Other sources of high-resolution multispectral data should be evaluated taking into account different continental climates and ecosystems. This shows the potential of using machine learning approaches to predict ECa from EMI sensors whether spatial, temporal, spectral, and sampling components in the study site.

Another question raised regards how large a dataset should be to fit a machine learning, which contained 49 observations for SOCstocks in our study. However, our validation strategy leaving one area out (farm 07: 9 observations) could provide valuable information on the efficacy of RF models and avoid overoptimistic conclusions. According to Padarian et al. (2020), there is no clear rule for the dataset size that constitutes a problem to use machine learning methods. It depends on how complex the problem is and its relationship with the environmental variables that can help explain or understand the problem. Therefore, ECa and high-resolution remote sensing data improved the prediction power of SOCstocks modelling in peatlands.

The drainage network and groundwater patterns can be clearly detected by the modelling scenarios (Fig. 4). It is vital to detect these characteristics in peatlands because agricultural lands involve draining peat, while increasing peat consolidation and decomposition (Hoogland et al. 2012), leading to land subsidence of peat soils (Minasny et al. 2019). Thus, in case there is a slight chance of extrapolating ECa information to better detect the aforementioned characteristics, it has to be taken into account. Therefore, we predicted SOCstocks with and without ECa information as one of the environmental variables (Fig. 7). Our results show that ECa improved SOCstocks predictions compared with predictions without ECa, evidencing the need to incorporate geophysics (ECa) into methods as a tool to analyse soils. The resulting map reveals important insights into the distribution of SOCstocks across the study area. We observe distinct spatial patterns, including areas with high and low SOCstocks, which can be attributed to different environmental factors, land management practices, and hydrological conditions. Proximal sensing data, such as that predicted ECa, is one of the most important environmental variables to enhance the mapping of peatlands and their properties (i.e., SOCstocks).

Conclusions

The methodology and approach used in this study showed the feasibility of generating ECa maps from field survey EMI data and utilising them as a valuable environmental variable for predicting SOCstocks in peatlands. Through extensive model assessments, our models exhibited higher accuracy compared to previous studies, attributed to the integration of spatially explicit ECa maps as engineered variables that significantly enhanced the accuracy of predicting SOCstocks, emphasising the importance of exploring new environmental variables in digital soil mapping.

Our study also highlights the scalability and benefits of machine learning approaches for predicting ECa from EMI sensors across larger areas. While the dataset size for machine learning methods remains a subject of consideration, our validation strategy effectively assessed the model performance and avoided overoptimistic conclusions. Overall, our findings indicate that ECa, along with high-resolution remote sensing data, can enhance the prediction power of SOCstocks modelling in peatlands. We advocate for the incorporation of proximal sensing data, such as predicted ECa from field survey EMI data, as a critical environmental variable in optimising peatland mapping and property characterisation, surpassing the reliance solely on LiDAR data.

We hope our research opens new avenues for further investigations, harnessing the potential of machine learning approaches, proximal and remote sensing to engineer new environmental variables. This advancement enhances our understanding and management of peatland soils and other terrestrial ecosystems.