Introduction

Dealing with global and regional challenges in land degradation, food security, water scarcity, and climate change, an accurate and updated geospatial soil information is imperatively needed [1, 2]. These problems are directly related to soil functions particularly to agricultural productivity, loss of biodiversity, and provision of water [3]. Producing accurate and reliable soil maps is indispensable in watershed management [4, 5], rangeland management, and landscape ecology [6].

The traditional soil survey process is tediously difficult to update rapidly and accurately. This process has associated significant limitations. First, significant changes in environmental conditions are not readily observed, especially when processing several variables simultaneously; secondly, the entire process must be repeated for each update that makes soil survey updates very inefficient [7]. Conventional soil survey adapts the manual process of producing a polygon-based soil map, whereby, without the computer-based approach, the map cannot be updated rapidly and accurately as the entire production procedure must be repeated [7]. Such method is time-consuming, requires numerous soil samples, and expensive [8]. Geographic information systems (GIS) can overcome this problem with the application of digital soil mapping (DSM). The DSM estimates soil properties by establishing interrelationships between soil properties and the environmental variables derived mainly from the digital elevation model (DEM) and remotely sensed images [8, 9]. Thus, the direction of DSM is toward the generation of dynamic and replicable geospatial soil information [10].

GIS algorithms have been adapted for an efficient spatial interpolation technique in land resources inventories [11, 12] in addressing the limitation of the traditional soil survey. GIS is a tool for data input, handling, analyzing and output process. It plays a significant role in spatial decision-making that involves information collection for DSM. GIS can perform several tasks using both spatial and attribute data and can integrate a variety of geographic technologies like Global Positioning System (GPS) and Remote Sensing (RS). GIS integrates spatial and geostatistical analysis, and the efficient management, storage, and retrieval of geographic data [8, 9, 13]. Thus, GIS as a tool plays a significant role in the implementation of a computer-based spatial decision making support system.

The slow progress in agricultural production and the steadily increasing population in the Philippines require the applications of this computer-based decision support system. The Philippine Department of Agriculture is still using over 40-year-old soil information in its programs for climate change mitigation and land use plan [14]. Bureau of Soils and Water Management is still in traditional soil survey method and is yet to implement DSM [15]. With these enormous challenges, there is an urgent necessity that soil maps in agricultural areas in the country be updated applying the modern technology of DSM.

Like most of the parametric test, multiple linear regression (MLR) requires normally distributed and homoscedastic variables [16, 17]. It is helpful for data analysis and inferences that both dependent and independent variables that are highly skewed [17, 18], and where standard deviations among variables significantly differ [16] should be transformed to normal or nearly normal distribution [19]. Thus, in this study, all variables were transformed to meet the conditions for normal or nearly normal distribution, minimum error and unbiased estimate [20].

Hence, this study aimed in establishing relationships between topsoil texture fractions (clay, sand, and silt) and the environmental variables by applying and comparing MLR, ordinary kriging (OK), simple kriging (SK) and universal kriging (UK) using transformed data integrating the use of free and open source software (FOSS) of R, System for Automated Geoscientific analyses (SAGA) GIS, and QGIS.

This study presents a successful approach in DSM using FOSS that are of best cost advantage to be adapted by any GIS users from a developing country like the Philippines. The achieved methodology can lead to valuable outcome in achieving a more comprehensive land use plan since the generated results are useful for watershed management particularly for ecological, hydrological, and crop suitability modeling.

Materials and methods

Materials

Synthetic Aperture Radar (SAR) DEM was acquired from MacDonald, Dettwiler and Associates, British Columbia, Canada and post-processed by the UP Training Center for Applied Geodesy and Photogrammetry, through the DOST-GIA funded Disaster Risk and Exposure Assessment for Mitigation Program which is downloadable through https://lipad.dream.upd.edu.ph/. This SAR DEM has 10 m resolution and a projection of WGS84 UTM Zone 51. A cloud-free Landsat 8 image in the study area was selected and downloaded from the United States Geological Survey data archive (https://earthexplorer.usgs.gov) for RS data. The study makes use of an adequate hardware setup with 32 GB multi-core processor, 64-bit operating system, and solid-state drive to avoid hanging or crashing [21] necessary for GIS and RS processing.

The study area

Barili, the study area, is a second income class municipality, with an annual income of ₱45 M (0.87 M USD) or more but less than ₱55 M (1.06 M USD), in the southern part of Cebu Province, Philippines. The municipality has a population of 73,862 based on 2015 census. Barili is located 60 km southwest of Cebu City (Fig. 1). It has an area of about 124 km2 (about 9.5 × 13 km) with mostly agricultural lands. The elevation ranges from sea level to 540 m. It is bordered to the northeast with Carcar City and Sibonga; to the southwest with Dumanjug and the west lies the Tañon Strait; and to the northwest with Aloguinsan. Mountain ridgeline separates the watersheds of Barili in the west and Carcar City and Sibonga in the east.

Fig. 1
figure 1

The location of the Municipality of Barili (natural color composite) as the study area in the Province of Cebu. The distribution of samples (magenta), and the mountain ridgeline (red)

Soil sampling, preparation, and analysis

A total of 177 soil samples were collected within the Municipality of Barili, Cebu, Philippines. A pit measuring approximately 1 × 1 × 1 m was dug manually in each physiographic position to examine and sample the soil profile. About 2 kg of composite soil sample were obtained from every horizon of each soil profile. Soil sampling was done randomly as shown in Fig. 1. Soil samples were air dried, freed of rocks and plant materials, ground using a wooden mallet and allowed to pass through a 2-mm sieve. Soil texture analyses were done at the Regional Soils Laboratory, Department of Agriculture, Regional Field Office VII, Cebu City, Cebu, Philippines.

Environmental predictors

A total of 37 parameters were generated and analyzed to understand their relationship with soil texture properties. Environmental variables or predictors were the preprocessed sink-filled or depressionless DEM and its derivatives, geographic coordinates, and RS data as shown in Table 1. Majority of the predictors were generated from the preprocessed sink-filled DEM using SAGA GIS modules for basic terrain analysis, terrain analysis–channels, terrain analysis–hydrology, terrain analysis–morphometry, and terrain analysis–lighting. The predictors include (i) 7 parameters by basic terrain analysis module such as analytical hillshading, aspect, channel network base level, slope length and steepness (LS) factor, profile curvature, slope, and valley depth; (ii) 4 parameters by terrain analysis–channels such as overland flow distance, horizontal overland flow distance, vertical overland flow distance, and vertical distance to channel network; (iii) 3 parameters by terrain analysis–hydrology such as catchment area, flow accumulation, and SAGA wetness index; (iv) 7 parameters by terrain analysis–morphometry such as curvature classification, mass balance index, mid slope position, multiresolution index of valley bottom flatness (MRVBF), normalized height, slope height, and terrain ruggedness index; (v) 1 parameter by terrain analysis–lighting such as sky view factor. Furthermore, slope steepness (S) factor for universal soil loss equation (USLE) was generated applying r.watershed module of geographic resources analysis support system in QGIS. Mountain ridge proximity and stream proximity were generated using proximity grid module of SAGA GIS. Location of samples by their latitude and longitude, determined by GPS, were also included as predictors.

Table 1 Environmental variables, description, descriptive statistics, and references

Remotely sensed image of operational land imager (OLI) and thermal infrared sensor (TIRS) instruments of Landsat 8 on the study area dated July 27, 2017 were also considered as additional predictors and was pre-processed using the semi-automatic classification plugin (SCP) using QGIS. The remotely sensed predictors include 7 bands (bands 1 to 7) of OLI and 2 bands (band 10 and 11) of TIRS.

Preprocessing of remotely sensed data

The SCP for QGIS was applied for Landsat 8 conversion to top of atmospheric reflectance and brightness temperature, and pansharpening [27]. Multispectral image analysis requires conversion of its “quantized and calibrated scaled digital numbers” [23] to top of atmosphere reflectance in order to achieve clear Landsat scenes [28] which is packaged in SCP. Pre-processing of Landsat 8 image was discussed in details by Congedo [27] and in the study by Mondejar and Tongco [29].

Exploratory data analysis

The exploratory data analysis consisted of three steps and was performed in R with contributing packages ‘rcompanion’ for plotting histogram and performing Tukey’s ladder of power transformations; ‘moments’ for determining skewness and kurtosis; ‘nortest’ for determining different normality tests. Step 1: statistical data distribution was checked by plotting its histogram. Step 2: determination of skewness and kurtosis before data transformation. And Step 3: perform Tukey’s ladder of power transformation.

Transformation of data

Normal distribution of data is an assumption in geostatistical analyses similar to several statistical techniques [16, 17, 30, 31]. In instances of non-normality, especially for strongly skewed data, data transformation to normality or at least symmetric distribution is needed [30, 31]. Normality test of the data can be implemented by simple evaluation of its skewness (close to 0) and kurtosis (close to the range of 1 to 3), normal q-q plot, and inferential test for normality such as Shapiro-Wilk or Kolmogorov-Smirnov tests [32].

Soil texture fractions and all of the predictors were transformed applying Tukey’s ladder of powers transformation in R to ensure normality or near normality of dependent and independent variables. In this study, skewness and kurtosis evaluation were applied. The Tukey’s ladder of powers transformation is defined as:

$$ y=\left\{\begin{array}{c}{x}^{\lambda}\kern0.1em \mathrm{if}\kern0.1em \lambda >0\\ {} logx\kern0.1em \mathrm{if}\kern0.1em \lambda =0\\ {}-\left({x}^{\lambda}\right)\mathrm{if}\lambda <0\end{array}\right\} $$
(1)

where λ is the power coefficient parameter in transforming values of a parameter to follow a normal distribution as close as possible [17, 33]. In the rcompanion package, Shapiro-Wilk tests were performed iteratively and determine lambda value “that maximizes the W statistic” [17]. Furthermore, loop function in R was adapted to determine a particular transformation that had a distribution with skewness as close to zero and kurtosis as close to the range between 1 to 3.

Predictive models

The procedures tested to predict soil properties (sand, silt, and clay) were MLR, OK, SK and UK. The geostatistical procedures were implemented in SAGA GIS software [34]. MLR has been widely used to predict the response of a dependent variable from a set of independent variables, as a function of the correlations between them. The MLR analyses were executed in SAGA GIS applying forward selection, backward elimination, and stepwise selection methods, fitting the model by identifying variables which have most significance according to 95% confidence level.

MLR for prediction

Multiple regression seeks to determine the equation that best predicts the dependent variable Y as a linear function of a set of independent variables X where the linear relationship is expressed as:

$$ \hat{Y}=a+{b}_1{X}_1+{b}_2{X}_2+{b}_3{X}_3\dots .+\epsilon $$
(2)

where \( \hat{Y} \) is the estimated value of Y for a given set of X values; a is the intercept; b1 is the estimated slope (partial regression coefficient) of a regression of Y on X1, considering the rest of X variables to be held constant, likewise for b2, b3, and so on; and ϵ is the error term [16]. In this study, the coefficient of multiple determination (R2) is considered in the goodness-of-fit of a linear model for cross validation of models determined by the following equation as:

$$ {R}^2=\frac{\sum_{i=1}^n{\left({\hat{y}}_i-\overset{\hbox{'}}{y}\right)}^2}{\sum_{i=1}^n{\left({y}_i-\overset{\hbox{'}}{y}\right)}^2} $$
(3)

where \( {\hat{y}}_i \) as the predicted or fitted values of the dependent variable, \( \overset{\hbox{'}}{y} \) is the mean of the observed values of the dependent variable, yi as the observed values of the dependent variable, and n is the number of observed values of the data set.

Relative importance of environmental predictors

Relative importance is the “quantification of an individual regressor’s contribution to a multiple regression model” [35]. In this study, relative importance metrics are forced to percentages or summed to 100%. Gromping [35] calls this “lmg” as proposed by Lindeman et al. [36]. This is achieved using package “relaimpo” in R by the following code: calc.relimp (linmod, type = c(“lmg”), rela = TRUE).

Geostatistical analyses

Geostatistical analyses in this study were done making use of SAGA GIS 7 adapting the spatial and geostatistics-kriging module for grid interpolation from irregular sample points with variogram dialog. DSM is one of the many useful applications in SAGA GIS for soil science [34]. Variogram models were first generated before kriging techniques for the predictive models of soil fractions.

Semivariogram

Kriging is based on the idea of spatial correlation or dependence that the degree of influence of observed points to unobserved points is inversely related to distance [20, 37, 38]. The first essential step in kriging is the calculation of empirical variogram before any geostatistical interpolation. In this study, isotropic semivariogram is determined considering only on the effect of h [39]. A semivariogram model estimates the relationship between semivariances or the differences between neighboring values and separation distance [40], such as the spherical or exponential models. The empirical semivariogram formula is as follows:

$$ \gamma (h)=\frac{1}{2N(h)}{\sum}_{i=1}^{N(h)}{\left[Z\left({x}_i+h\right)-Z\left({x}_i\right)\right]}^2 $$
(4)

where γ(h) is the semivariance of the lag distance between two sample points; N(h) is the number of observation pairs of random variables separated by distance h; and Z(xi) and Z(xi + h) are the values of variables located at xi and xi + h, respectively [39, 41, 42]. The empirical semivariogram for each soil texture fraction was fitted with theoretical semivariogram models in SAGA GIS using its variogram module. In this study, the spherical, exponential and Gaussian models were fitted into the empirical variograms for the dependent soil fraction variables (clay, sand, silt).

Spherical:

$$ \gamma (h)=0\ \mathrm{if}\ h=0 $$
(5)
$$ \gamma (h)=n+\left(s-n\right)\left[\frac{3h}{2r}-\frac{1}{2}{\left(\frac{h}{r}\right)}^3\right]\mathrm{if}\ 0<h<r $$
(6)
$$ \gamma (h)=n+\left(s-n\right)\ \mathrm{if}\ h>r $$
(7)

Exponential:

$$ \gamma (h)=n+\left(s-n\right)\left[1-\mathit{\exp}\left(-\frac{h}{r}\right)\right] $$
(8)

Gaussian:

$$ \gamma (h)=n+\left(s-n\right)\left[1-\mathit{\exp}\left(-{\left(\frac{h}{r}\right)}^2\right)\right] $$
(9)

where n is the nugget effect at zero distance h, s is the sill, and r is the range [39, 43,44,45].

SK

Generally, weighted average of observed data points is an approach of spatial interpolation that is expressed as:

$$ \gamma (h)=\frac{1}{2N(h)}{\sum \limits}_{i=1}^{N(h)}{\left[Z\left({x}_i+h\right)-Z\left({x}_i\right)\right]}^2 $$
(10)

The empirical semivariogram for each soil texture fraction was fitted with theoretical semivariogram models in SAGA GIS using its variogram module. In this study, the spherical, exponential and Gaussian models were fitted into the empirical variograms for the dependent soil fraction variables (clay, sand, silt). The approximation of SK is based on the equation:

$$ {\hat{z}}_{sk}\left({x}_o\right)={\sum \limits}_{i=1}^n{\lambda}_iz\left({x}_i\right)+\left[1-{\sum \limits}_{i=1}^n{\lambda}_i\right]\mu $$
(11)

where μ is a given stationary mean of the observed values that is assumed to remain constant throughout the domain, λi is the kriging weight assigned at sampled locations [31, 46, 47] wherein the kriging weights are estimated by minimizing the variance [46].

OK

The OK is analogous to SK but OK involves that the summation of weights equals to one, such that \( \left[1-{\sum}_{i=1}^n{\lambda}_i\right]=0 \), with an accompanying Lagrange parameter ψ. Additionally, μ is not constant but it is recalculated within the search window across the modeled area of interest [46, 47]. Eventually, OK is estimated as:

$$ {\hat{z}}_{ok}\left({x}_o\right)={\sum \limits}_{i=1}^n{\lambda}_iz\left({x}_i\right) $$
(12)

In order to obtain unbiased estimations for OK, the following equations are solved simultaneously:

$$ {\sum \limits}_{j=1}^n{\lambda}_j\gamma \left({x}_i,{x}_j\right)+\psi =\gamma \left({x}_i,{x}_0\right) $$
(13)
$$ {\sum \limits}_{j=1}^n{\lambda}_j=1 $$
(14)

where γ(xi, xj) is the value of the variogram between two points xi and xj, ψ and is the Lagrange parameter for the minimization of kriging variance [31, 37, 41].

UK

OK and SK assume stationary process considering the mean. However, in other cases, spatial data are not stationary in the mean [31, 48] which is being considered in UK. Geostatistical literatures have used the terms UK, kriging with external drift, and regression kriging that are essentially adapting similar techniques [40, 49]. In UK, prediction at unvisited location is estimated by combining the predicted drift and residuals as:

$$ \hat{z}\left({x}_0\right)=u\left({x}_0\right)+\hat{e}\left({x}_0\right) $$
(15)

where drift or trend u(x0), a linear regression mean, is fitted by linear regression analysis, and \( \hat{e}\left({x}_0\right) \) is the interpolated residuals [40, 44, 49]. The trend can be expressed as a functional form:

$$ u(x)={\sum}_{k=0}^K{\beta}_k{f}_k(x) $$
(16)

where βk, k = 0, 1, …, K are unknown coefficients while fk(x) are known functions of x [31, 48]. In this study, the best model resulted from MLR analyses for each soil fraction was applied and predictions at unobserved locations are estimated by the following equation:

$$ {\hat{z}}_{uk}\left({x}_o\right)={\sum}_{i=1}^n{\lambda}_i\left({\sum}_{k=0}^K{\beta}_k{f}_k\left({x}_i\right)+\hat{e}\left({x}_i\right)\right) $$
(17)

Cross validation

Cross validation technique was adopted for evaluating and comparing the performance of models. The models were evaluated applying k-fold cross validation due to the absence of validation data set and small size of observed samples for model validation [50, 51]. In this study, 5-fold cross validation was adapted. There is no strict rule on the choice of k, but 5 or 10 is usually applied [51]. Cross validation of models was done in R applying caret package. The mean absolute error (MAE), root mean squared error (RMSE), and R2 were determined to evaluate the accuracy of models [8, 52, 53].

$$ MAE=\frac{1}{n}{\sum}_{i=n}^1\left|{y}_i-{\hat{y}}_i\right| $$
(18)
$$ RMSE=\sqrt{\frac{\sum_{i=1}^n{\left({y}_i-{\hat{y}}_i\right)}^2}{n}} $$
(19)

where yi is observed value, \( {\hat{y}}_i \) is the predicted value, and n is the number of samples.

Additionally, MLR models require homoscedasticity and normality of residuals [6, 54]. Thus, a scatter plot of residuals and predicted values were visually evaluated for homoscedasticity using plot function where a horizontal line at zero was added using abline function in R. Additionally, its normality test (skewness, kurtosis, Lilliefors test) along with its quantile-quantile (QQ) plot was determined in R applying plot diagnostics of plot function. Hence, homoscedasticity and normality of residuals were critical validation criteria as being meet by each model in this study.

Results and discussion

Exploratory data analyses

In this study, geostatistical analysis of data started with summary statistics, data transformation (Table 2), MLR modeling, variography, kriging, and mapping. The exploratory analysis revealed no outlier in the soil texture fractions (clay, sand, silt) and environmental variables. Thus, all 177 samples were used for the spatial interpolation. A statistical summary of the untransformed environmental variables (Table 1) shows high coefficient of variation between −3609 to 224.7%. Skewness and kurtosis values vary greatly between −3.85 to 7.20 and 1.77 to 70.7, respectively. Indicating that the variables are highly skewed with heavy tails, thus, do not fit the normality assumption. The Tukey’s ladder of powers transformation was applied in transforming the data that resulted in normal or nearly normal distribution having skewness values close to zero and kurtosis values have thinner tails (Table 2). Consequently, all data analyses from MLR modeling, variography, kriging, and cross validations of models were implemented using the transformed data.

Table 2 Descriptive statistics of transformed variables and the corresponding transformation

MLR modeling

The MLR analyzed the relationships between the transformed topsoil texture fractions and the transformed environmental variables. Table 3 shows the MLR results where forward selection, backward elimination, and stepwise selection methods were applied and cross-validated. At this stage, MLR established correlation between transformed dependent and independent variables. The dependent variables were the transformed values of soil texture fractions and the independent variables were the transformed values of environmental predictors. All predictors selection methods in MLR, in this study, revealed Lilliefors test values greater than 0.05. These were indications that the transformation of values both for dependent and independent variables resulted in an improvement of normality and variance stability [19], minimum error and unbiased estimate [20]. The predictors of selection methods with most homogeneity of variance (homoscedasticity) in terms of residuals were adapted as predictor grids, respectively, for clay, sand and silt fractions for UK as implemented in SAGA. The cross validation of MLR models are shown in Table 3 and their relative importance will be subsequently discussed.

Table 3 Cross validation of models by MLR applying different predictors selection methods

Cross validation of MLR models

Prior to kriging, evaluation of regression models was implemented. Table 3 shows the cross validation of models applying different regression selection methods and the normality statistics of residuals. The MAE, RMSE, homoscedasticity and normality (skewness, kurtosis and Lilliefors test) test of residuals were considered in assessing the accuracy performance of linear predictive models. Homoscedasticity and normality test of residuals were applied in order to verify and satisfy the assumption of homogeneity of variance [54] towards choosing the best linear unbiased estimator. All results of the regression models were coupled with kriging for the implementation of UK to further evaluate their performance and be able to determine the best unbiased models. Stepwise selection method has the advantage of having the least number of significant predictors, while forward selection method had the lesser number of significant predictor compared to backward elimination method.

Based on error metrics, performance of forward and stepwise selection methods for clay fraction were the same because they revealed the same significant predictors; this was also observed for sand fraction. Backward elimination method had the least MAE and RMSE for each of the soil texture fractions, and, consequently, it has the highest R2 (Table 3). Figure 2 shows scatter plots of residuals and predicted values for homoscedasticity examination. As can be observed in any of the residuals plots, homoscedasticity was not violated by any of the regression models applying forward selection, backward elimination, and stepwise selection methods for clay, sand, and silt texture fractions as linearity (red lines) assumption was met having no obvious pattern wherein the residuals scattered randomly and symmetrically clustering towards the middle of the plot. For a graphic examination of the normality of residuals, Fig. 3 shows QQ-plots of residuals of the MLR models applying forward selection, backward elimination, and stepwise selection methods. Graphically, different selection methods revealed similar distribution of residuals where, generally noticeable, most of the points fall closely either below or above the reference line (red), and no obvious presence of outliers. The regression models for sand fraction have the most points that lie very close to the reference (red) line. This was also statistically confirmed in cross validation statistics (Table 3) where sand fraction regression models have the least skewness and kurtosis values. Nevertheless, none of the residuals distribution deviates greatly from normality and that residuals of all regression models are statistically normally distributed based on Lilliefors normality test (Table 3).

Fig. 2
figure 2

Residuals plots of multiple linear regression models applying forward (F) selection, backward (B) elimination, and stepwise (S) selection methods

Fig. 3
figure 3

QQ plots of residuals of the multiple linear regression models applying forward (F) selection, backward (B) elimination, and stepwise (S) selection methods

Relative importance of predictors

Figure 4 shows that stream proximity, elevation, mountain ridge proximity, channel network base level, flow accumulation, band 4, band 11, catchment area, vertical overland flow distance, overland flow distance, slope, and curvature classification were the significant predictors in this study.

Fig. 4
figure 4

Relative importance of predictors

Stream proximity was found to be the most important predictor in this study. Pahlavan-Rad and Akbarimoghaddam [8] found stream proximity or distance from the river as the most important variable in the spatial variability of soil texture fractions in a flood plain. For clay fraction, elevation, mountain ridge proximity, and channel network base level were about equally second most important predictors. Likewise, slope and stream proximity were about equally important for silt. Mountain ridge proximity or distance to the mountain ridge is a predictor probably first introduced by this study for DSM for soil texture fractions prediction. Mountain ridgeline is a peak elevation line that delineates watersheds at opposite sides of a mountain (leeward and wayward). The results suggest that stream proximity, mountain ridge proximity, and slope are the most important predictors that influence topsoil texture spatial variation, particularly for clay and sand fractions, in the study area. Remotely sensed data, such as Landsat 8, where band 4 and band 11 were found important, pointed out that optical imagery are potential auxiliary variables in enhancing prediction of topsoil texture fractions [37]. Flow accumulation, as in the study conducted by Castro-Franco et al. [22], was also the least important in the spatial variability of silt fraction at the southern Argentine Pampas.

Geospatial interpolation

After knowing significant predictors for each of the soil texture fractions (clay, sand, and silt), predictions of soil texture fractions were further evaluated applying kriging techniques. Likewise, linear regression was enhanced by combining regression with kriging. Prior to kriging, for every soil fraction, the experimental semivariogram was determined to examine the spatial dependence of dependent variables within the observed data points [6]. As a variogram model was required for geo-statistical interpolation, the variogram models that related to the best results in cross validation were the best fitted variogram models correspondingly for clay, sand, and silt. Hence, predefined models such as exponential, Gaussian, and spherical were fitted to the empirical semivariograms (Table 4 and Fig. 5). The mentioned variography was done in SAGA applying its variogram module.

Table 4 The variogram parameters of soil texture fractions applying exponential, Gaussian and exponential models
Fig. 5
figure 5

Best fitted variogram models (red) for clay, sand, and silt

Semivariogram

The nugget to sill (N:S, C0/[C0 + C]) ratio or its complementary structured part to sill ratio is considered as a spatial dependency criterion of a semivariogram [6]. The smaller the N:S ratio or the higher the structured part to sill ratio, the stronger the spatial dependency or autocorrelation [6, 55]. The N:S ratios for clay, sand, and silt varies within 0.11–0.22, 0.14–0.24, and 0.08–0.17, respectively, for the three variogram models (Table 4). These indicated that topsoil texture fractions have a strong spatial dependency, especially for silt, considering that N:S ratios are lesser than 0.25 [6, 55, 56]. Nugget effects of exponential and spherical models were lesser than the Gaussian model. Effective range of exponential, Gaussian, and spherical were ascending in order. Effective ranges of exponential and Gaussian models do not differ greatly, while effective range of spherical model is more or less twice the Gaussian model.

Performance of kriging techniques

The variogram parameters, as shown in Table 4, were adapted for OK, SK, and UK. All linear regression models considered in this study were coupled with UK. In the same manner, the performance of OK, SK, and UK applying exponential, Gausian, and spherical variaogram models were cross validated in terms of MAE, RMSE, and R2 separately for clay, sand, and silt texture fractions (Tables 5, 6 and 7). Since forward and stepwise selection methods revealed the same significant predictors for clay and, likewise, for sand for the MLR (Table 3), UK-Stepwise in Tables 5 and 6 also represent UK-Forward for clay and sand, respectively.

Table 5 Cross validation of OK, SK, and UK applying exponential, Gaussian, and spherical variogram models applied to Clay texture fraction
Table 6 Cross validation of OK, SK, and UK applying exponential, Gaussian, and spherical variogram models applied to Sand texture fraction
Table 7 Cross validation of OK, SK, and UK applying exponential, Gaussian, and spherical variogram models applied to Silt texture fraction

Generally, kriging techniques (Tables 5, 6 and 7) have better prediction accuracy than the linear regression models (Table 3) in terms of MAE, RMSE, and R2. The performance of kriging techniques fitted with exponential or spherical variogram models was comparable, while kriging techniques fitted with Gaussian models were likewise at par (Tables 5, 6 and 7). As observed, kriging techniques fitted with Gaussian models have low prediction accuracy compared to kriging techniques fitted with exponential or spherical variogram models. Comparing MLR with kriging techniques, the respective MAE of clayMLR, sandMLR, and siltMLR were about 2.5, 2, and 2.8 times as high compared to kriging techniques fitted with exponential or spherical variogram models. Similarly, MAEs of clayMLR, sandMLR, and siltMLR were about 1.4, 1.2, and 1.6 times as high compared to kriging techniques fitted with Gaussian variogram model. In other words, R2 values of kriging techniques with exponential or spherical variogram model were about 3, 3.8, and 3.8 times as high compared to clayMLR, sandMLR, and siltMLR, respectively, while, R2 results of kriging techniques with Gaussian variogram model were about 2.1, 2, and 3.1 times as high compared to clayMLR, sandMLR, and siltMLR. Kriging technique with Gaussian variogram model had the least performance among the different kriging techniques applied in this study.

For clay texture fraction, UK with predictors by backward elimination method and exponential variogram model (ClayUK-back-expo) revealed the lowest MAE and RMSE of 0.126 and 0.178, respectively. Consequently, Clay-UK-back-expo exhibited the highest R2 of 0.878 among the models presented in (Table 5). Moreover, graphical evaluation of residuals revealed significant improvement from MLR to UK. Residuals by UK reduced significantly closer towards the red line (Fig. 6). Thus, ClayUK-back-expo was considered the final model for transformed clay texture fraction as shown in Fig. 7, while Fig. 8 represents the final model of clay after back transformation.

Fig. 6
figure 6

Homoscedasticity and improvement of residuals from multiple linear regression (MLR) with backward selection method to universal kriging (UK) with exponential variogram applied to Clay, Sand, and Silt texture fractions

Fig. 7
figure 7

Final Model of Transformed Clay by UK with predictors by backward elimination method and exponential variogram

Fig. 8
figure 8

Final Model of Clay after back transformation

Likewise, for sand texture fraction, UK with predictors by backward elimination method and exponential variogram model (SandUK-back-expo) revealed the lowest MAE and RMSE of 0.138 and 0.183, respectively. Consequently, SandUK-back-expo exhibited the highest R2 of 0.821 among the models presented in (Table 6). Furthermore, graphical visualization of residuals has shown significant improvement from MLR to UK. Residuals by UK reduced significantly closer towards the red line (Fig. 6). Thus, SandUK-back-expo was considered the final model for sand texture fraction as shown in Fig. 9, while Fig. 10 represents the final model of sand after back transformation.

Fig. 9
figure 9

Final Model of Transformed Sand by UK with predictors by backward elimination method and exponential variogram

Fig. 10
figure 10

Final Model of Sand after back transformation

Similarly, for silt texture fraction, UK with predictors by backward elimination method and exponential variogram model (SiltUK-back-expo) revealed the lowest MAE and RMSE of 0.141 and 0.221, respectively. Consequently, SiltUK-back-expo exhibited the highest R2 of 0.8929 among the models presented in (Table 7). Moreover, graphical evaluation of residuals revealed significant improvement from MLR to UK. Residuals by UK reduced significantly closer towards the red line (Fig. 6). Thus, SiltUK-back-expo was considered the final model for silt texture fraction as shown in Fig. 11, while Fig. 12 represents the final model of silt after back transformation.

Fig. 11
figure 11

Final Model of Transformed Silt by UK with predictors by backward elimination method and exponential variogram

Fig. 12
figure 12

Final model of Silt after back transformation

For the three soil texture fractions of clay, sand, and silt, UK with predictors by backward elimination method and exponential variogram model performed the highest predictive accuracy. This is an indication that the datasets are characterized by a linear trend, thus, UK is recommended to attain more fitting mapping results [44]. Nevertheless, performance of UK with predictors by stepwise selection method and exponential variogram model for clay, sand, and silt texture fractions was at par as shown from Tables 5, 6 and 7. Thus, UK can be considered a hybridized geospatial interpolation technique that coupled significant predictors from regression model and variography of regression residuals [57, 58]. However, in general, performance of OK and SK was comparable to UK (Tables 5, 6 and 7). Hence, for a straightforward and immediate estimate, OK and SK can be adapted [49]. Furthermore, it is observed that the stronger the spatial dependency or autocorrelation, the higher the R2. Hence, high R2 values of OK, SK, and UK were strongly attributed to high spatial dependency or autocorrelation.

Conclusions

Comprehending geographical distribution and accurate predictive mapping of topsoil texture fractions at a municipal scale are essential for watershed management, soil and water conservation, hydrological and crop suitability modeling. This fundamental soil texture geo-information processing approach is among the vital steps towards attaining a comprehensive land use plan. Spatial and geostatistics modeling, specifically MLR and kriging techniques (OK, SK, and UK), were fitted to predict topsoil texture fractions particularly for clay, sand, and silt. Data from sink-filled DEM derived environmental variables combined with remotely sensed images of Landsat 8 generated satisfactory results. Mountain ridge proximity, not considered among the predictors in DSM before but now, was found to be a significant predictor in this study. Further investigation should be conducted on the other side of the mountain ridge to understand better its predictive importance. Geospatial interpolations by kriging techniques were more accurate than MLR. The OK, SK, and UK fitted with Gaussian variogram model were less suitable for predicting soil texture fractions whereas, UK coupled with predictors by backward elimination method and fitted with exponential variogram portrayed the most accurate in predicting topsoil texture fractions for clay, sand, and silt compared to any of the spatial and geostatistical modeling applied in this study. The results show that the transformation of dependent and independent variables provides homoscedasticity and normality of residuals to ensure unbiased estimation. With the significant cost advantage of this study on the use of FOSS, the models obtained by DSM revealed several benefits particularly in storing, retrieving and reproducing reliable geospatial information and ease of updating data and analysis. This study presents a successful approach in DSM with the use of SAGA GIS, QGIS, and R that are of best cost advantage to be adapted by any GIS users from a developing country like the Philippines. Moreover, additional soil samples and field validations of results are recommended for future research with the aim of improving accuracy and reliability of the models. Likewise, application of DSM should be extended in other municipalities throughout the central Visayas region and the rest of the country. The achieved methodology in this study is of great importance to decision makers at municipal, provincial, and regional scale for a very valuable outcome in achieving a more comprehensive land use plan since the generated results are useful for watershed management particularly for ecological, hydrological, and crop suitability modeling.