1 Introduction

Soil salinization is the salts accumulation process in the subsurface that destroys soil composition and water quality, reduces agricultural productivity, and negatively influences economic growth [1]. As one of the most serious land degradation forms, salinization occurs in arid and semiarid zones, where evaporation exceeds precipitation [2]. The Food and Agriculture Organization (FAO) [3] estimated the salt-affected surface area to be 831 million ha in total, including 434 million ha of sodic soils and 397 million ha of saline soils. In Hungary, salinity and sodicity affect one-third of the Great Hungarian Plain soils, and the potential salt-affected soils (SAS) cover one-third of the territory [4]. Drought, climate change, low water resources, and land-use changes can aggravate salinity conditions [5]. Therefore, authorities tend to adopt monitoring strategies to control this process continuously. Control and prevention approaches on the regional scale are time-consuming and require various resources. As a result, the geographic information systems (GIS) and remote sensing techniques opened the window for technology to replace or assist conventional methods. They established easier, less time-consuming, and inexpensive techniques to assess and predict environmental processes such as salinization. Using spectral response extracted directly from sensors or after the application of spectral transformations, i.e., principal component analysis (PCA) [6, 7], tasseled cap transformation [8], and spectral indices [9, 10], have given promising results in terms of prediction accuracy. Many scholars emphasize the importance of spectral signature in remote sensing studies as a core concept in this field [11,12,13,14]. Digital soil mapping has been the focus of several research projects operated with different types of remotely sensed data, using statistical or geostatistical methods. El Hafyani et al. [15] demonstrated Landsat 8 OLI image's efficacity to model soil salinity in Morocco's Tafilalet plain. The results yielded a coefficient of determination R2 ranges from 0.53 to 0.75, and a root-mean-square error (RMSE) ranges from 0.62 to 0.80 dS/m. Hihi et al. [16] found a medium correlation (R2 = 0.48) between electrical conductivity (EC) and spectral indices derived from a Sentinel 2 MSI image using a simple linear regression model. In Uzbekistan, using variance analysis, Ivushkin et al. [17] found a significant correlation between soil salinity and canopy temperature derived from the moderate resolution imaging spectroradiometer (MODIS) data. The Gaussian processes method applied on SAR Sentinel-1 data and advanced machine learning algorithms produced a highly accurate model that explains the relationship between remotely sensed data and EC, with a coefficient of determination R2 equal to 0.808 [18]. Taghadosi et al. [19] revealed the efficiency of intensity images derived from VV and VH polarization of SAR Sentinel-1 imagery in the discrimination of saline soils. The support vector regression (SVR) technique with radial basis function (RBF) kernel produced the most accurate model, with a coefficient of determination R2 equals 0.9783 and an RMSE equals 0.3561. In Algeria, Dehni et al. [10] proved the importance of Landsat ETM+ and ALI-EO-1 images in identifying and delineating saline and sodic soils. Zurqani et al. [20] used a time series of Landsat data for 29 years (1972, 1987, and 2001) and field measurements to detect the spatiotemporal variation of soil salinity in Libya. Weng et al. [21] suggested using hyperspectral imagery to produce a higher accuracy. The univariate regression using a new salinity index (SSI) derived from EO-1 Hyperion data yielded a coefficient of determination R2 equal to 0.873. Sahbeni G. [22] applied a multiple linear regression analysis between salt content and Sentinel 2 MSI data for modeling soil salinity distribution in the Great Hungarian Plain. The final model showed a significant moderate accuracy with a coefficient of determination R2 equal to 0.51 and an RMSE equal to 0.1942 g/kg. Li Z. et al. [23] analyzed the performance of multiple linear regression (MLR), geographically weighted regression (GWR), and random forest regression (RFR) models in terms of soil salinity prediction. The findings revealed that remote sensing imagery scanned during the dry season is better for estimating electrical conductivity (EC), with topographic variables and spatial location playing a crucial part in the modeling. However, the random forest regression (RFR) model had a higher prediction accuracy than the GWR model, while the MLR model had the highest error. Al-Ali Z.M. et al. [24] studied eight different physical models for soil salinity mapping in an arid landscape using spectral reflectance measurements and Landsat 8 OLI data. The statistical tests for models based on visible-near-infrared (VNIR) bands showed insignificant fits with a coefficient of determination R2 of 0.41 and a very high RMSE (≥ 0.65). In contrast, models based on the second-order polynomial equation using the shortwave infrared (SWIR) bands produced better results, with a coefficient of determination R2 equal to 0.97 and an RMSE equal to 0.13. The study conducted by Zhang X. and Huang B. [25] to predict soil salinity using soil-reflected spectra showed that smoothing methods and spectral transformations influenced the models' estimation accuracies. The PCR based on the median filtering data smoothing method gave the most accurate results, with a coefficient of determination R2 equal to 0.7206 and an RMSE equal to 0.3929.

The studies mentioned above have explored the potential efficiency of remotely sensed data in estimating salinity when coupled with field measurements. The electromagnetic signal sensitivity to surface soil parameters can be processed to map salinity distribution with great accuracy. Nevertheless, most of these reviewed studies used statistical analysis and satellite imagery to model soil salinity in the Afro-Eurasia region. Only a few studies focused on the Central-Eastern European zone, although Hungary has the largest expanse of naturally salt-affected soils in the continent.

In flat landscapes with a typical continental climate like the Great Hungarian Plain, soluble salts accumulate more quickly. Certain environmental and climatic conditions make the landscape threatened by salinization, mainly when groundwater levels are oscillating. As a result, modeling soil salinity variability and mapping its spatial pattern for the Great Hungarian Palin become an essential task to endorse successful soil reclamation programs that mitigate possible fluctuations in salinity levels. The aims of this study are to; (i) predict salinity using remotely sensed data derived from Landsat 8 OLI sensor and field measurements through evaluating four regression methods; Multiple Linear Regression (MLR), Partial Least Squares Regression (PLSR), Ridge Regression (RR), and Artificial Neural Network (ANN), and (ii) prove the importance of spectral enhancements in soil salinity modeling.

2 Study area

The study area is located in the Great Hungarian Plain (GHP), east of Hungary (Fig. 1). It lies between longitudes 20°22'24" and 21°41'51", and between latitudes 47°5'6" and 47°48'40", covering an area of 7602.84 km2. The average elevation is 88.8 m above sea level. Half of the Great Hungarian Plain (GHP) is flat with topographic level differences of 3–4 m. Wind-blown sand on the hills, loess and loess-like sediments above the level of the actual floodplains, and silty clay in flat alluvial areas are the three deposit types that can be found on its surface [26, 27]. The main river is the Tisza. It flows through the plain and collects tributaries from almost the entire lowland area. The study area climate is semiarid, with an average annual temperature of 11 °C [28]. The most precipitation occurs between May and July, while the least precipitation falls between January and March. The annual mean precipitation is less than 500 mm in the Tisza river's low altitude valley [29].

Fig.1
figure 1

Location of the study area; a Satellite imagery map of Hungary (ESRI Basemap, ArcMap 10.3), b Sampling sites' locations

3 Materials and methods

3.1 Remotely sensed data

We downloaded a Landsat 8 OLI image from the United States Geological Survey (USGS). It was acquired on the 08 of August, 2016. The difference between the imagery acquisition day and the field measurements date is due to the scarcity of cloudless accessible data (Cloud Cover < 5%). Table 1, extracted from the metadata file, illustrates the properties of remotely sensed data used.

Table 1 Properties of Landsat 8 OLI image used

SRTM GL1 (30 m Ellipsoidal) is a version of the Shuttle Radar Topography Mission dataset that uses elevation values from WGS84 ellipsoidal height instead of the normal orthometric or geoid-referenced elevation. The EGM96 geoid model was subtracted from the standard SRTM GL1 data to create this dataset. It has a resolution of 1 arc-second (30 m) [30]. We downloaded an SRTM elevation model in GeoTIFF format. The OpenTopography Facility provided it with support from the National Science Foundation [31].

3.2 Soil sampling

The Research Institute for Soil Sciences and Agricultural Chemistry (RISSAC) provided the field measurements database on the 16th of March 2020. Soil samples are collected within the subsoil's upper layer (30 cm) from mid-September until mid-October 2016 [32, 33]. The field campaign is carried out during the dry season to improve salt spectral properties at the surface during salt accumulation. Electrical conductivity is measured in the laboratory by immersing an electrode in a water-saturated soil paste at the plasticity limit [34]. Soil salinity is determined using saturated paste according to the Hungarian Standard MSZ-08-0206/2-1978 [35, 36]. MSZ-08-0206/2-1978 protocol details can be found in (MSZ 1978, 1978) [37].

3.3 Preprocessing

We converted Landsat 8 OLI image to Top of Atmosphere (TOA) Reflectance [54]. Fast line-of-sight atmospheric analysis of spectral hypercubes (FLAASH) was applied using ENVI 5.3 software package for atmospheric correction with standard options [21, 38, 39].

3.4 Spectral indices

Based on the literature review [40,41,42,43,44,45,46], we computed eighteen spectral indices from preprocessed data using ENVI 5.3 and ArcMap 10.3. These indices were selected for this study due to their potential association with salinity detection. Table 2 shows the mathematical expressions of spectral indices.

Table 2 Spectral indices used in regression analyses

The information in intercorrelated variables is condensed into a few variables called principal components. Principal component analysis (PCA) application removes spectral noise and reduces information redundancy. Therefore, we used PCA to compress information from visible, near-infrared, and SWIR bands [6, 7]. Based on the statistics file produced by ENVI 5.3, the first two components explain approximately 99% of data variance. They contain the majority of information that can be useful for soil salinity modeling (Fig. 2).

Fig.2
figure 2

Eigenvalues and variance accounted by principal components

The effect of topography on salts movement within the profile is revealed by many scholars, mainly where the water balance is negative [57, 58]. Therefore, we included elevation as an independent covariate in the regression modeling to evaluate this component's influence on salinity's spatial distribution.

After extracting spectral indices, bands, PCA1, PCA2, and elevation values corresponding to sampling sites, we performed regression analyses between salt content measurements and spectral data. The goal is to understand the statistical relationship between remotely sensed data and the sampled soil's salt content. Figure 3 illustrates the study's methodology.

Fig. 3
figure 3

Schematic diagram of the study methodology

3.5 Regression modeling

On systems with limited computing resources, linear regression is a simple algorithm that can be trained quickly and efficiently. It has a significantly lower time complexity than other machine learning algorithms [59,60,61].

Due to their robustness to collinearity, which is frequent between multiple variables, ridge regression and partial least squares regression approaches are commonly used [62, 63]. The artificial neural network is distinguished by the efficiency of handling highly complex data, along with the potential to generalize, as discussed by many scholars [46, 64, 65]. This study evaluates the statistical methods cited above. Details are explained and well-presented by Bishop C.M. [66].

Multiple linear regression is a linear approach to modeling the relationship between one or more explanatory variables and a scalar variable [16, 50]. Equation (1) describes the relationship between the dependent and explanatory variables.

$$Y = A_{0} + A_{1} * x_{1} + A_{2} * x_{2} + A_{3} * x_{3} + \cdots A_{n} * x_{n}$$
(1)

where Y is the dependent variable, xi is the explanatory variable, Ai is the coefficient of the variable i, and A0 is the intercept.

By adding a bias to the regression estimates, the ridge regression overcomes the data collinearity problem. The penalization factor or lambda regularizes the coefficients to penalize the optimization function if the coefficients take large values [62, 67]. The ridge estimation of coefficients in a linear regression model y = β(k)*x + b is described in Equation (2).

$$\hat{\beta }\left( k \right) = \left( {X\prime X + kI} \right) - 1 X\prime Y{ }$$
(2)

where X is an n-by-p independent variable matrix, X' is its transpose, I is an n-by-n identity matrix, Y is an n-by-1 vector of observations, and k is the ridge parameter.

The Hoel–Kennad formula in Equation (3) can be used to calculate the ridge parameter.

$$k = \hat{\sigma }^{2} /{\text{max}} \left( {\hat{\alpha } \, i 2} \right)$$
(3)

where σ̂ 2 is the estimation of residual squares, and σ̂ is defined as σ̂ = Λ−1G'X'Y, where G is an orthogonal square matrix, and Λ is a diagonal matrix. Λ and G are associated with this equation: G'(X'X)G = Λ. [62]

The partial least squares regression (PLSR) reduces the predictors to a smaller collection of uncorrelated components before performing least squares regression on these components. This approach is advantageous when the predictors are strongly collinear. The predictors are measured with error, making PLSR more robust to uncertainty measurement [63, 68].

$$y = \mathop \sum \limits_{1}^{n} w_{{\text{i}}} * \rho_{{\text{i}}} + w_{0}$$
(4)

where y is the dependent variable, ρ is the explanatory variable, w is the model coefficient for the ith variable, w0 is the constant item, and n is the number of variables.

The artificial neural network (ANN) is a problem-solving model inspired by the biological neuron system and consisted of many highly interconnected computing elements called neurons. It takes a nonlinear path and processes data in parallel through nodes, resulting in a complex adaptive system that can alter its internal structure by changing input weights [69]. In this study, we used a feedforward neural network with a single hidden layer. Its simplicity distinguishes it from other types of artificial neural networks, in which data moves through various input nodes before it reaches the output node [46, 70]. For a regression function y = f (x), a feedforward network defines a mapping y = f (x; θ) and learns the value of the parameters θ, resulting in the best function approximation, described in Eq. (5).

$$y = f\left( {\mathop \sum \limits_{1}^{n} x_{{\text{i}}} * w_{{\text{i}}} + b} \right)$$
(5)

where y is the dependent variable, x is the explanatory variable, n is the number of independent variables (i goes from 1 to n), w is the weight of the explanatory variable, and b is the bias added to the function. Figure 4 illustrates the neural network's basic concept.

Fig. 4
figure 4

Basic concept of artificial neural network [71]

We used seventy-six soil samples in this study, with 70% for training and 30% for validation. Three statistical metrics, i.e., root-mean-square error (RMSE), mean squared error (MSE), and min-max accuracy, are computed to assess regression models' performance.

Equations (6) and (7) describe the mathematical expressions of MSE and RMSE.

$${\text{RMSE}} = \sqrt {\mathop \sum \limits_{1}^{n} \left( {\hat{y}_{{\text{i}}} - y_{{\text{i}}} } \right)/n}$$
(6)
$${\text{MSE}} = \frac{1}{n}*\mathop \sum \limits_{1}^{n} \left( {y_{{\text{i}}} - \hat{y}_{{\text{i}}} } \right)$$
(7)

where ŷi is the predicted value for the ith observation, yi is the measured value for the ith observation, and n is the total number of observations.

4 Results and discussion

4.1 Descriptive analysis

The main statistical parameters for the salt content data are given in Table 3. The disparity between the minimum and the maximum reflects a spatial variability in salinity levels. The normality test revealed that the salt content data were positively biased. The contrast between the mean and the median has demonstrated this point.

Table 3 Descriptive statistics on salt content samples

We applied a square root transformation to increase the distribution of data to normality. This was approved by a remarkable improvement in skewness from 1.96 for the initial data to − 0.39 for the normalized data.

4.2 Models performance

A low RMSE value indicates an accurate prediction. Similarly, a low MSE value indicates a more precise estimation. We calculated the statistical parameters using Eqs. (6) and (7). More RMSE and MSE are closer to zero, higher is the model's accuracy. The statistical performance of the regression models is given in Table 4. Overall, the four regression models have a satisfactory goodness-of-fit for the training and test sets, based on Table 4 and Fig. 5. The MSE of the ridge regression (RR) model is the lowest for the test sample (= 0.007). The MLR model has the lowest MSE (= 0.004) for the training sample, followed by the RR and the PLSR models (MSE=0.006), and eventually the ANN model (MSE=0.029). The MLR model yielded the lowest RMSE value (= 0.004), followed by the PLSR model (= 0.006), the RR model (= 0.081), and finally, the ANN model (= 0.118). To evaluate the models' accuracy, we ran a min-max accuracy test, which takes the mean values of the minimum and maximum values between the measured and the predicted values. It has a range from zero to one. A min-max accuracy score equal to one indicates a perfect match between the actual and estimated values [72]. The RR model has the highest accuracy (= 0.75), followed by the PLSR model (= 0.73), the MLR (= 0.73), and the feedforward ANN model (= 0.71).

Table 4 Models' performance using three statistical parameters
Fig. 5
figure 5

Relationship between measured and predicted normalized salt content values, a MLR, b PLSR, c RR, d ANN

The objectives to be achieved determines the appropriate approach for a model's selection. Overall, the highest accuracy model would be favored, but very low RMSE and MSE must also be considered. Hence, the ridge regression (RR) model is regarded as the best model with the highest accuracy (= 75 %) and the lowest MSE for the test set.

4.3 Salinity prediction map

Based on three statistical metrics analyses, the ridge regression model was selected for predicting soil salinity in the Great Hungarian Plain. Therefore, we used Eq. (8).

$$\begin{gathered} {\text{SQRT}}\left( {{\text{SSC}} \% } \right) = 1.36582 + 0.00087* {\text{LST}} - 0.03777 * {\text{CRSI}} - 0.00103 * {\text{BI}} + 0.1195*{\text{ SI}} - \hfill \\ 0.01713 * {\text{SI}}_{1} - 0.05036 *{\text{ SI}}_{2} + 0.016 *{\text{SR }} + 0.0141 * {\text{NDSI }} - 0.25137 *\log \left( {{\text{elevation}}} \right) + \hfill \\ 0.00059 * {\text{VSSI}} + 0.02754 * {\text{RSI}} - 0.1545 * {\text{PCA}}_{1} + 0.03208 * {\text{PCA}}_{2} + 0.00106 * {\text{SI}}_{3} - \hfill \\ 0.00184 * {\text{SI}}_{5} - 0.0083 * {\text{DVI}} - 0.10293 * B_{7} \hfill \\ \end{gathered}$$
(8)

The maximum predicted value of normalized salt content equals 0.38. The minimum estimated value equals − 0.13. The mean value equals 0.21, and the standard deviation equals 0.03. Water bodies and urban infrastructures are the most common negative predictions. As a result, we used a masking tool to exclude these pixels from the landscape. According to Weng et al. [21], negative predictions are illogical and indicative of prediction errors. However, it can also be caused by residual noise that occurs after atmospheric corrections.

As data rescaling represents an essential step in data analysis, we converted the salt content variable from (%) to g/kg, giving it broader ranges for the classification (Table 5). The final map is presented in Fig. 6.

Table 5 Soil salinity content levels using Chinese Saline Soil Classification Standard presented by WENG et al. [21]
Fig.6
figure 6

Soil salinity prediction map using the final model

Figure 6 shows that around 99% of total classified pixels are non-saline and less than 1% are low saline. The maximum predicted value of salt content equals 1.46 g/kg. In contrast, the minimum estimated value equals zero. The mean equals 0.44 g/kg, and the standard deviation equals 0.11 g/kg.

A visual interpretation of the map showed a noticeable association between extremely non-saline soils and the elevation. Despite the study area's absolute flatness, the effect of microrelief in salinity distribution is relatively evident on the map. Comparing LST variation with soil salinity distribution, we found similarities between the two variables' spatial distribution. High LST values are frequent in soils with higher salinity, while lower LST values occupy firmly non-saline soils (0 g/kg). The correlation matrix derived from the band collection statistics test shows a strong negative correlation with elevation, and a moderately low positive correlation with LST, with a coefficient of determination R2 equal to − 0.86 and 0.28, respectively.

Overall, saline land features discrimination is an achievable task using spectral response values and spectral transformations derived from multispectral sensors. This agrees with Matinfar et al. [73] and  Wu et al. [74] results regarding spectral indices' efficiency in soil salinity mapping. Mousavi et al. [39] found a strong association between spectral indices and soil salinity distribution using artificial neural network (R2 = 0.964), which is relatively in agreement with our findings. Nevertheless, the ANN model performed the least among other regression models producing the lowest accuracy (= 71%) and the highest RMSE (RMSE= 0.118). The small sample size used to train the neural network limited the prediction performance of this method. Using a multiple linear regression model, Hihi et al. [16] found a medium correlation between spectral indices derived from Sentinel 2 MSI sensor and salt content in the Tunisian South, with a determination coefficient R2 equals 0.48. Allbed et al. [50] found a medium correlation with a coefficient of determination R2 equals 0.65 using an IKONOS image and regression techniques in Al-Hassa oasis, Saudi Arabia. Asfaw et al. [2] found a strong relationship between remotely sensed data and electrical conductivity (EC) in a regression model with a coefficient of determination R2 equals 0.78. These findings agree with our approach, elucidating statistical modeling's potential to map soil salinity with lower costs.

Furthermore, this work demonstrates the significance of shortwave infrared bands, and in particular the second SWIR band, in salinity prediction. This agrees with Hihi S et al. [16] and Lamqadem A et al. [75] studies. Bannari et al. [76] also revealed the importance of SWIR bands in soil salinity modeling using Sentinel 2 MSI image. Electrical conductivity in the laboratory (ECLab) showed a moderate association with SWIR bands with a coefficient of determination R2 equal to 0.5 for SWIR1 and 0.6 for SWIR2. The relationship is uncertain for sand patterns, moist soils, and vegetated areas with high salinity values and low reflectance in the near-infrared channel, where prediction errors are more likely to occur. Future studies will focus on using data derived from hyperspectral sensors to overcome this issue, as recommended by previous studies [21, 53, 77].

Despite models' accuracy differences from one study area to another, a significant relationship between remotely sensed data and soil salinity has been well-established in the literature. Moreover, several environmental factors influence soil salinity measurements and spectral information reception, explaining the findings disparity within previous studies.

Like many natural processes, soil salinization is influenced by environmental variations, making it difficult to track and analyze with absolute certainty. The prediction errors persist even with using the most robust statistical technique. Therefore, it is suggested to perform a validation sampling to evaluate the regression model's applicability and estimate its real predictive potential.

5 Conclusions

Soil salinization is a major environmental threat that affects soil and water quality, reduces agricultural production and food safety worldwide. As a result, collecting valuable information on salinity levels is a critical challenge for decision makers to improve and sustain land management strategies.

We discussed this issue by evaluating the usefulness of Landsat 8 OLI imagery for soil salinity prediction using four state-of-the-art regression modeling algorithms. Accordingly, the ridge regression approach performed the best in this research with a penalty parameter equal to 0.387 and a min-max accuracy equal to 75%. The model can be applied in another study area with similar climatic conditions and geological formations. As revealed by many scholars in the literature section, Landsat 8 OLI image has shown its relative efficiency in soil salinity estimation. This study demonstrates the importance of spectral enhancements in mapping environmental parameters such as soil salinity and identifying pattern variations. Hence, the elevation is proved to have a statistical significance for soil salinity prediction, with a p-value equal to 0.002. Throughout the combination of field measurements and spectral information, remote sensing represents an inexpensive and effective reliable tool to estimate soil salinity.

Future studies will focus on reducing spectral noise, improving the model's predictive power, and investigating the role of environmental factors in salts distribution and movement.