Introduction

The presence of cadmium (Cd), a heavy metal, in soil presents a substantial threat to agricultural productivity. The increasing deposition of heavy metals in soil and the environment has been attributed to several factors such as the burning of fossil fuels, mining operations, non-standard use of pesticides and fertilizers, release of municipal effluents, transportation, and other industrial processes (Vaverková et al. 2019). These activities have contributed to ecological imbalances. The proliferation of industrialization and the adoption of intensive farming practices have expedited the accumulation of heavy metals in agricultural regions, resulting in a range of soil-related problems including land desertification, soil pollution, reduction in soil fertility, soil erosion, and salinization (Nosrati and Collins 2019; Zhang et al. 2020). The presence of heavy metal contamination in soil poses distinct difficulties because of its prolonged persistence, irreversibility, restricted movement, elevated toxicity, and intricate chemical makeup (Tsai et al. 2019; Dhaliwal et al. 2020; Zhao et al. 2023). Non-point heavy metal contamination in agricultural soils is mostly attributed to the extensive use of fertilizer and pesticides in agricultural production (De Souza et al. 2016; Kurwadkar 2019; Dong et al. 2020). The buildup of heavy metals in agricultural soils is attributed to the use of phosphorous and biologically derived fertilizers that include these heavy metals (Chao et al. 2019). On the other hand, the abuse of various chemicals and the utilization of fertilizers and pesticides with acidic properties in agricultural settings has resulted in soil acidification in numerous regions from a physiological standpoint (Wang et al. 2019a; Xu et al. 2020). Soil acidity has been worsened by the misuse of different chemicals and the use of acidic fertilizers and pesticides in agricultural environments, which has led to increased solubility and mobility.

New methods are developed to reduce the heavy metal content of fertilizers, while phosphorus fertilizers may still contain high amounts of heavy metals due to the content of phosphate rocks used in production (Nicholson et al. 2003; Lambert et al. 2007; Nziguheba and Smolder 2008; Kacar 2013). The superphosphates, double superphosphates, triple superphosphates and phosphorus compound fertilizers produced using phosphoric and sulphuric acid or wet combustion method contain higher potential hazards (Kacar 2013). However, the P fertilizers using phosphoric acid produced through dry combustion contain negligible Cd content (Kacar 2013; Lopez-Valdez and Fernandez-Luqueno 2014). The other nitrogenous and potassium fertilizers consist of relatively lower Cd content (Senesi and Polemio 1981; Molina et al. 2009; Jiao et al. 2012; Benson et al. 2014).

While descriptive explanations can be made with traditional statistical methods, regression models such as machine learning can make more meaningful descriptions of the data set. The high heterogeneity of agricultural soils reduces the prediction rates of traditional regression models (Padarian et al 2019). Machine learning models offer solutions to multivariate and complex structures unlike traditional regression models. Machine learning has predicted environmental processes and assessed pollutant toxicity using existing data (Gao et al. 2021; Yang et al. 2021). Cadmium, a heavy metal, poses a significant threat to ecosystem services. Competent tools such as machine learning are needed to model the behaviour of these threats in complex structures such as soils (Huang et al. 2023).

The previous studies have been primarily focused on potential sources of Cd with some other heavy metals in contaminated soils urban areas (Yadav et al. 2019; Wang et al. 2019b; Vural et al. 2021), industrialized and mining areas (Koca 2019; Wanhong et al. 2020), agricultural lands (Chakaraborty et al. 2019; Ozkan and Uygur 2019; Taspinar et al. 2022; Bayraklı et al. 2023; Gunal et al. 2023), and urban parks and public playgrounds (Luo et al. 2019; Yesil and Yesil 2019). Although there is little research on machine learning modelling specifically for lead in soils (Huang et al. 2023), there are researches on modelling heavy metals (Hu et al. 2020; Zhang et al. 2020, 2021; Zhao et al. 2023). The progress in machine learning techniques has shown substantial benefits in enhancing the precision of forecasts. For instance, Zhang et al. (2020) discovered that the presence of As, Zn, Pb, Hg, Ni, Cu, and Cr in the soil indicates that support vector machines (SVM) and random forest (RF) models may effectively predict heavy metals in urban soil. In their study, Zhang et al. (2021) examined the presence of As, Zn, Pb, Hg, Ni, and Cu in soil. They found that random forest models can effectively predict urban soil heavy metal levels. Hu et al. (2020) conducted a comparison of random forest, gradient-boosted machine, and generalized linear models. Employing ten machine learning methods to forecast the absorption of Cd, Hg, As, and Pb in rice grain. In Zhao et al. (2023) study, the extremely randomized tree model had superior performance in predicting rice-Cd and rice-Hg levels, but the random forest model yielded the best results for As and Pb. Given its classification as heavy metal, Cd presents substantial hazards and demonstrates a wider spectrum of pollution caused by agricultural practices, mainly fertilization. Therefore, it is crucial to thoroughly examine its presence in soil systems in relation to routinely analysed soil properties. However, there is a shortage of research that employs machine learning approaches to forecast the concentration of accessible Cd in soil.

The majority of traditional statistical methods for predictive modelling rely on the assumption of a Gaussian relationship between the response variable and predictors. Additionally, most methods assume that the impact of interactions between predictors on the response variable is consistent throughout the whole range of predictor values. Both assumptions are unjustified in the majority of circumstances. However, the use of a quadratic function to describe Gaussian responses in machine learning models has often resulted in strong prediction accuracy. Computer technology and data science developments offer the opportunity to make strong predictions with machine methods on multivariate and non-homogeneous soil sites. More detailed data on the most dynamic and mobile of the heavy metals, Cd, regarding farmland soils of Konya plain, the primary producer of wheat in Türkiye, are still scarce on a very large scale. In this study, we aimed to predict Cd concentration of soils using machine learning and regression models using basic soil properties.

Material and methods

According to the Köppen classification, Konya province has a cold semi-arid climate (BSk), while the Trewartha classification categorizes the climate as a temperate continental (Dc) (Peel et al. 2007; Yılmaz and Çiçek 2018). The mean temperature during the summer season is approximately 30 °C, characterized by chilly nights, while the average temperature in winter is − 4.2 °C. Annual average precipitation is low (325 mm) and primarily occur during the winter and spring seasons.

This study focuses on the Great Konya Basin which is an important basin for agricultural in Türkiye. The region is known for its diverse fruit production, particularly cultivating sweet cherry, apple, peach, and other fruits. Additionally, the area is suitable for the growth of cash crops such as corn and clover, provided that irrigation water is accessible or wheat and other drought-resistant cereals in the rain-fed areas (Fig. 1). The consistent use of commercial fertilizers with some organic fertilizers, has proficiently sustained the yields of the orchards. However, due to freshwater scarcity in the region, a very large portion of the agricultural lands are under a low-input rain-fed agricultural system. Therefore, drought-resistant cereals such as wheat, barley, etc. are the most common field crops in the area.

Fig. 1
figure 1

Sampling nodes on the land-use classification map according to Coordination of Information on the Environment (CORINE) 2018 (Anonymous 2024)

The altitude of the Great Konya Basin is approximately 1000 m. The influx of multiple rivers has influenced the Great Konya Basin, leading to the significant role of lacustrine carbonate formation and diagnosis in determining the physicochemical and mineralogical characteristics of the soils (de Meester 1971; Ozaytekin et al. 2012). The soils were primarily formed on Quaternary sediments, de Meester (1970a, b). These sediments were deposited in a shallow lake during the Late Pleistocene era, forming several sandy beach ridges and sand plains. The presence of diverse sediments led to the formation of different physiographic units, including uplands, colluvial slopes, piedmont plains, bajadas, terraces, alluvial plains, lacustrine plains, and soft-calcareous lake bottoms (de Meester 1970a, b). The flat Neogene limestone terraces are situated in the peripheries of the Konya Basin. The terrain has gradual inclines leading towards the central region, which has been subject to localized fragmentation due to the presence of erosion gullies. The southern region of the basin is characterized by alluvial plains and fans that are composed of sediments ranging from coarse sand to clay textures (Roberts et al. 1979; Roberts 1983).

Soil sampling

Soil sampling locations distributed within the study area is shown in Fig. 1. Soil samples were collected from the 0–20 cm depth. The samples were homogenized using standard soil ploughing techniques. Total of 538 soil samples were random collected (Fig. 2), of which 357 were from agricultural fields (mainly cereals), 85 from irrigated corn and fruit orchards, 47 from non-agricultural areas, 41 from pastures, and 8 from inside the city (Fig. 1).

Fig. 2
figure 2

Sampling locations on soil classification map of the study area

The soil samples were air-dried and were subsequently sieved through a 2 mm sieve. The physical and chemical attributes of the soil were evaluated through conventional techniques applicable to calcareous soils (Sparks 1996; Kacar 2009). The soil properties that were ascertained include the mechanical analysis, which involved using a hydrometer method to quantify the proportion of soil particles of clay, silt, and sand sizes. The saturation paste was used to determine the electrical conductivity EC (EC was used to calculate total soluble salt) and pH levels, which were measured using an EC probe and a combined pH electrode (Rhoades 1996). Cation exchange capacity (CEC) was determined by sodium acetate (1 M, pH 8.2) method (Sommer and Miller 1996). The organic matter content was determined using the dichromate oxidation method of Walkley–Black (Nelson and Sommer 1996), and calcium carbonate content was determined with a pressure calcimeter (Loeppert and Suarez 1996) while the available phosphorus was measured using the NaHCO3 method (Kuo 1996). The soil samples were solubilized using a solution of aqua regia with a volumetric ratio of 3:1 of HNO3 to HCl. Subsequently, the concentrations of phosphorus (P) and cadmium (Cd) in the resulting digests were analyzed using Inductively Coupled Plasma Optical Emission Spectroscopy (ICP-OES) with a PerkinElmer Optima 2100 instrument, following the guidelines outlined in the United States Environmental Protection Agency (USEPA 1996) protocol.

Statistical procedures

The descriptive statistics of all soil properties from each observation point were presented as mean, median, standard deviation, variance, skewness, kurtosis, range, minimum, maximum, and coefficient of variation (Table 1). Then the relationship between measured soil parameters was determined using Spearmen correlation.

Table 1 Descriptive statistics for measured soil properties

The data was initially partitioned into two distinct sets, namely the training set, which constituted 70% of the data, and the testing set, which comprised 30% of the data. The model was trained using the first 70% of the measured soil properties. The training dataset was partitioned into two subsets, namely the learning and validation datasets, with a random split of 70% and 30%, respectively. These subsets were utilized to train a variety of machine learning and regression models. The use of Cd facilitated the selection of optimal parameters for a variety of machine learning and regression models through a grid search approach in parameter space. The final model was chosen based on the root mean squared error (RMSE) value, with preference given to the model with the lowest RMSE. All statistical analyses were conducted utilizing Rstudio version 4.3.0 software (R Core Team 2023).

Machine learning algorithm

The multivariate adaptive regression splines (MARS) algorithm was introduced by Friedman (1991) as a nonparametric machine learning technique that effectively identifies pattern challenges in both classification and regression tasks, especially for data that exhibit nonmonotonic or nonlinear characteristics. The MARS model is employed to generate a set of functions that represent linear regression that have the ability to predict the values of the continuous variable of interest.

Decision trees are algorithms that are based on trees and are utilized to assess a quantitative feature, as Ali et al. (2015) noted. Breiman et al. (1984) developed the classification and regression trees (CART) procedure to fulfill this particular objective. The CART algorithm is a type of tree model that employs a binary approach, wherein a node is recursively divided into two child nodes. The algorithm involves an iterative procedure for acquiring a group of uniform nodes from a dataset used for learning, aiming to reduce the error variance across both the training and test sets.

The random forest approach employs a nonparametric algorithm for analyzing data, which was first introduced by Breiman (2001). This technique is capable of handling tasks such as classification and regression. The application of regression trees in diverse amalgamations. Constructing regression trees by utilizing a subset selected randomly from predictors is feasible. The bootstrapping methodology is implemented to choose a subset of characteristics randomly. As a result, in the random forest algorithm, each individual regression tree assigns a unique group of predictors to its root, internal, and leaf nodes. Svetnik et al. (2003) proposed that the mean of tree results situated at the nodes that correspond to the leaves can be used to estimate the dependent parameters prediction values.

Friedman (2001) introduced XGBoost as a proficient and expandable algorithmic implementation grounded on gradient boosting. The XGBoost methodology is founded on the gradient-based tree technique. Furthermore, the method known as XGBoost is a tree-based regression algorithm that utilizes the same decision rules as the decision tree algorithm. Yu et al. (2020) state that the XGBoost employs a collection of tree structures for classification and regression to construct an expression that can proficiently accommodate the training dataset. Moreover, XGBoost exhibits the potential to exploit sparsity and tackle the problem of overfitting in the dataset by incorporating shrinkage and regularization methodologies (Gertz et al. 2020). In the training phase, XGBoost employs decision trees to distinguish between two cohorts and identify the variables that can improve the model's effectiveness. Furthermore, it is customary to prioritize computational efficiency rather than incorporating superfluous variables, as Gertz et al. (2020) stated. The primary objective of this procedure is to construct an ensemble of decision trees that exhibit high variance and low bias.

Model comparison criteria

The metrics frequently utilized for model comparison include root mean squared error (RMSE), standard deviation ratio (SDR), coefficient of determination (R2), and coefficient of variation (CV) (Table 2). In the assessment conducted to determine the adequacy of fit, it is necessary to consider many factors. These include low values of root mean square error (RMSE), coefficient of variation (CV), and standard deviation ratio (SDR). Conversely, high values of coefficient of determination (R2) and correlation coefficient (r) are desirable. The EhaGoF package, which was developed by Eyduran (2019), was utilized to compute the goodness of fit criteria.

Table 2 Assessing the Performance of the Models used to estimate Cd concentration of soils

Results and discussion

Correlation analysis was used to determine the relationship between Cd concentration in soils and soil properties including EC, pH, total phosphorus (P), organic matter (OM), calcium carbonate equivalent (CaCO3), available phosphorus (P2O5), cation exchange capacity (CEC), sand, silt, and clay (Fig. 3). The correlation test facilitates comprehension of the impact of variables on Cd levels in soil and the selection of appropriate variables for application in various machine learning and regression models. Cadmium concentration of soils has a significant correlation (P < 0.01) with EC (r = 0.16), pH (r = 0.17), P (r = 0.26), P2O5 (r = 0.22), CEC (r = 0.13), Silt (r = 0.16), Clay (r = 0.22), CaCO3 (r = − 0.28), and sand (r = − 0.26). Besides soil organic matter, all of the soil properties have significant correlations, which different mechanisms may explain. The strong relation between the electrical conductivity and Cd concentration may be attributed to the relatively high mobility of both salt and Cd in the soil environment (Kabata-Pendias 2011). On the other hand, the irrigated agricultural lands have been under increasing salinity threat (Usta 1999) which also requires larger amounts of fertilizer input than the typical rain-fed management system. The CEC and clay content are related to Cd sorption, which results in accumulation.

Fig. 3
figure 3

Bivariate Spearmen correlation coefficients between the soil properties. Significant at the 0.01 (**) and 0.05 (*) levels for the 2-tailed correlation coefficient. P total phosphorus, OM organic matter, CaCO3 calcium carbonate equivalent, P2O5 available phosphorus, CEC cation exchange capacity, TCd total cadmium, EC electrical conductivity

The concentration of heavy metals in soils is influenced by various factors, including pH, soil organic matter, cation exchange capacity, and clay content (Alloway 2012). The study conducted by Chavez et al. (2015) examined the correlation between Cd and soil properties. The findings indicated a strong association between Cd, clay, CEC, pH, P2O5, and P. According to Seshadri et al. (2016), applying phosphorus fertilizers, an agricultural practice has been identified as a significant contributor to soil Cd contamination. The present findings indicate that the variables of EC, pH, P2O5, P, CEC, silt, and clay exhibit a positive correlation with Cd, which is consistent with the existing literature. The reaction between the lime present in the environment and Cd results in the formation of an insoluble compound (Sø et al. 2011). According to Cattani et al. (2008), the incorporation of CaCO3 resulted in a 25% decrease in the Cd content of rice. The negative relationship between sand content and Cd concentration can be attributed to reduced surface area and ease of leaching. In contrast, the negative correlation observed in lime content is purportedly linked to a relative decrease in the portion of high sorption capacity soil components such as clay and sesquioxide minerals and incorporation of raw parent material due to shallow soil depth at hilly topography (de Meester 1971; Ozaytekin et al. 2012).

Chemometric relations of cadmium

Model performance assessment from the MARS, decision trees, linear regression, random forest, and XGBoost are given in Table 3. According to Grzesiak and Zaborski (2012), the RMSE and SDR values were near zero; Wilding et al. (1994), the coefficient of variation (CV) values were less than 10%, while the R-squared (Rsq) values approached 100, the models can be successfully used to the prediction processes. Hence, the XGBoost algorithm emerged as a dependable machine learning technique for accurately forecasting the Cd concentration, as evidenced by the results presented in Table 3. The XGBoost model exhibited a strong positive correlation between predicted and actual values, as depicted in Fig. 4. The findings suggest that the XGBoost algorithm demonstrated a proficient ability to make accurate predictions. The results revealed that the XGBoost machine learning model has excellent predictive ability, which is indicated by an extremely high coefficient of determination (0.981) predictive values compared to other machine learning models (Fig. 4, Table 3).

Table 3 Model performance assessment of relationships between soil properties and Cd
Fig. 4
figure 4

The prediction performance of the models

Figure 5 presents the impact of the explanatory variables utilized in the XGBoost model on Cd. The EC (12.77%) and pH (11.78%), parameters related to solubility and equilibrium in soils, are important in predicting Cd by the XGBoost model. The parameter P2O5, which is the available phosphorus increase upon the application of phosphorus fertilizers, which are given for the sustainability of agricultural production, and which contain trace amounts of Cd, is important in the estimation of Cd (11.30%). In the study area, a former lake bottom, the physical parameters silt (10.25%), and CaCO3 (10.25%) are also important. It is observed that parameters such as EC, pH, silt, and CaCO3, which can be counted from the genetic characteristics of soils, play a dominant role in Cd estimation. At the same time, available phosphorus and the total P related to fertilization, controlled by anthropological factors totaling up to 21.19% of the variance, take the first place in Cd estimation. These five parameters totaled a 56.51% variance in the estimation of Cd by the XGBoost model. Indeed, there could be a significant contribution from the topography as a driving force, but this study did not consider it.

Fig. 5
figure 5

Contribution of each variable to the XGBoost model (P total phosphorus, OM organic matter, CaCO3 calcium carbonate equivalent, P2O5 available phosphorus, CEC cation exchange capacity)

The machine learning method, which started to be used in modelling the inputs of natural ecosystems, has not yet reached the necessary importance in soil science. In the study by Tırınk and Öztürk (2023), predictions were made using MARS and XGBoost algorithms to predict particles of PM10 particle size. Faraz et al. (2023) study the body weight of animals estimated using MARS and XGBoost algorithms. In their study, Demirel et al. (2023) estimated wheat grain yield using MARS and CHAID algorithms. Eren et al. (2023) modeled of DNA methylation in vitro conditions using machine learning methods. The literature research conducted on the Web of Science platform on May 16, 2023, retrieved 19 scientific papers concerning XGBoost in the field of soil science. Most of these publications have been utilized to create models based on data acquired from spectrophotometric techniques, such as VNIRS and/or remote sensing predictions (Andrade et al. 2020; Mahmoudzadeh et al. 2020; Abedi et al. 2021; Ågren et al. 2021; Nguyen et al. 2021; Xu et al. 2022). The publications obtained through online open-source searches demonstrate a certain level of resemblance regarding the subject matter (Chen et al. 2022; Ge et al. 2021; Dhaliwal et al. 2022; Ye et al. 2021; Budak et al. 2023). The study of Budak et al. (2023) demonstrated that boosting algorithms produced the most resilient reliability estimates. Similarly, Gokmen et al. (2023) employed the XGBoost machine learning methodology to determine the available boron content in soil science with a similar approach, yielding high reliability with an R2 coefficient of 0.997. In another study, Huang et al. (2023) investigated the performance of various models under different constraints. Their findings indicate that the post-constraint XGBoost model exhibits a high level of accuracy (R2 = 0.81) in predicting the concentration of soil available Cd. Notably, this model successfully maintains the predicted values within the theoretical boundaries. The impact of explanatory variables of the XGBoost model on the concentration of Cd is primarily constituted by parameters that demonstrate significance in the correlation analysis. The XGBoost regression model is expected to exhibit high reliability in estimating total cadmium in soils.

Conclusions

This study tested several effective machine learning and regression models that provide a new perspective on the consistency of Cd levels and soil properties. The developed XGBoost machine learning model is expected to help identify Cd, assess economic and environmental impacts in a variety of local settings, and guide decision-making by environmental policymakers with a relatively limited number of soil tests. For farmers, this study suggests that phosphorus fertilization should be done with caution, taking into account solubility and equilibrium relationships based on easily measurable soil parameters such as pH and EC. For legislators and administrators, phosphorus fertilization strategies should be examined more carefully when preparing national fertilization programs. Especially in arid and semi-arid areas, fertilization strategies that are in excess of crop needs should be avoided. The present research evaluated the Cd levels of Konya Plain soils. The XGBoost algorithm, a machine learning model, can detect Cd concentration in soils better than other methods using measured soil properties. This paper presents new insights from the use of XGBoost in a relevant environment. The insightful findings guide further studies. The results hold great promise for environmental and interdisciplinary research planning in this field. Due to the infinity of soil properties, topography, agronomic management and cover crop differences, it should be investigated whether this research can be applied to other ecosystems with different sizes, crops and other varieties to improve prediction accuracy.