1 Introduction

The metalloid selenium (Se), with atomic number 34, is biologically vital as an essential micronutrient for humans and some animals, although it is toxic at high concentrations (Zhao et al. 2020). Toxicologically, the natural occurrence and pollution of Se pose significant health risks in several parts of the globe. Its accumulation as a consequence of geological and geothermal activity, exacerbated by human activity, has become a serious environmental and public health concern. It is regarded as the "double-edged sword element" because of its beneficial and toxic properties to health (Lichtfouse et al. 2022). Se is a non-metal with chemical properties comparable to Sulphur. In the environment, Se ions can be found in four natural valence states, which are elemental Se (0), selenide (− 2; Se2−), Se (IV) (IV; SeO32−), and Se(VI) (VI; SeO42−) (Albukhari et al. 2021). Se (IV) and (VI) in aqueous solutions as oxyanions, specifically Se (IV) in + 4 oxidation state (SeO32−) and Se(VI) in + 6 oxidation state (SeO42−). These oxyanions are known to be highly mobile and more toxic compared to reduced species of Se. Consequently, Se (IV) and Se (VI) have significant environmental repercussions (Lin et al. 2020). The inorganic forms of Se (IV; SeO32− and VI; SeO42−) exist as diprotic acid in the form of selenious acid (H2SeO3) and selenic acid (H2SeO4), respectively. Se may enter the food chain more readily in an aquatic environment through the roots of plants or aquatic organisms, where it can rapidly develop toxic and harm not just the health of aquatic animals but also that of other life forms, including humans, due to bioaccumulation in the food chain (Scheinost et al. 2021; Song et al. 2022). The solubility and retention of Se in water are primarily regulated through biogeochemical mechanisms (Ullah et al. 2018, 2022)

Biosorption technique is currently being investigated using non-conventional adsorbent materials, commonly known as "biosorbents" due to their biological nature. Biosorption is a water treatment method that employs natural, low-cost materials derived from renewable resources, often marine or agricultural. These materials, which are often polysaccharides, contain a large number of reactive groups on their surface, which are used in this process. Biosorbents include chitosan, biomass (microorganisms), and agricultural wastes such as rice husks, peanut shells, and a variety of others. These components may eliminate contaminants, such as Se species, that are present in trace amounts in complex combinations to reach the wastewater regulatory limit (Kidgell et al. 2014; Lichtfouse et al. 2021).

Forestry and agriculture are both moving toward a circular economy by generating energy from biomass wastes via pyrolysis, which yields three primary products: charcoal, gas (made during pyrolysis), and bio-oil. Biochar is a stable carbonaceous by-product produced through thermochemical conversion of biomass under oxygen-deficient supply (Zhen et al. 2023). Due to the high pyrolysis temperature, biochar has a high pH, large surface area, redundant nutrients, maximum cation exchange ability, and a stable structure. The characteristics of biochar may be altered by partial gasification with steam, which alters its porosity, structure, and functional groups, yielding activated carbon (Satyro et al. 2021). Biochar can serve as a supportive framework for active chemicals such as iron oxides, which can improve purification of water and soil. Contaminants like selenium, chromium, and arsenic, have been successfully eliminated using Fe-carbon composites (Hong et al. 2020). Adsorbents with low cost and high effectiveness for particular pollutant removal have been developed by numerous researchers (Ighalo et al. 2022). Despite its multiple advantages as an adsorbent, biochar's low adsorption rate limits its usage in the removal of anionic contaminants such as Se. Biochar engineering, which involves changing its synthesis method or injecting it with chemicals to boost its adsorption capability, has lately received much attention (Meilani et al. 2021; Shi et al. 2021). Because Se may be extracted by biochar that has been changed with Fe oxides, impregnating Fe into biochar will allow quick and efficient synthesis of a Se adsorbent. Impregnated biochar with metal oxides, metal hydroxides, and metal elements generates new composites that may significantly improve its performance (Lee et al. 2021).

It is critical to develop materials with excellent engineering and environmental characteristics for large scale separation applications, particularly in environmental applications. The newly created materials exhibit a high adsorption capacity due to their unique structure, enabling easy regeneration of the used adsorbent and improving process efficiency. These properties should be considered when designing a novel adsorption material. Computational methods such as machine learning (ML), can aid in the development and utilization of predective models to anticipate the separation behavior of pollutants and effectively remove various components. This approach not only reduces process costs but also saves energy and time (Zhu et al. 2022).

Artificial intelligence is now used practically in every industry, including chemical processing (Chiu et al. 2021), stock market predicting (Zhong & Enke 2019), media recommendation systems (Ramzan et al. 2019), clinical diagnosis (Kononenko 2001), cybersecurity (Chellam et al. 2018), and environmental sensing and research (Hafsa et al. 2020). Machine learning (ML) is the process of using an algorithm to train and predict the data without being clearly programmed. ML encompassess three primary types: supervised, unsupervised, and reinforcement learning. In simple terms, supervised learning uses labelled data (known output variables) to train a model. On the other hand, unsupervised learning works with unlabeled data (no fixed output variable) to explore patterns and features in data. Reinforcement learning involves a trial-and-error approach aimed at enhancing the rewards obtained from action. Each type of ML employs its own unique set of algorithms (Kooh et al. 2021).

A ML approach may be considered to solve the problem by modelling and learning the adsorption behavior of heavy metals on biochars (Z-Flores et al. 2017). As a consequence, the first step is to develop and train a high-quality model capable of accurately forecasting adsorption efficiency. Even though empirical models such as the Langmuir and Freundlich models have been used to explain the adsorption equilibrium for decades, they were unable to draw predictive conclusions, and the association among adsorption outcomes and operation circumstances was unknown (Febrianto et al. 2009). The recent development of surface complexation modeling combined with spectroscopic methods, holds promise for understanding the interaction between metals and adsorbents (Vithanage et al. 2013). However, this process requires specialized equipment, rendering it unavailable to the general public. Fortunately, new ML models, particularly random forest (RF), have demonstrated their effectiveness in modeling and predicting complex and non-linear relationships between independent and dependent variables (Zhu et al. 2019). The RF is a new ensemble ML method based on decision trees that uses voting for categorization and averaging for regression, offering a valuable tool for prediction.

To reduce the efforts and material required for conducting additional tests, using an accurate model for novel experimental adsorption data is benificial. Hence, the ML technique was chosen for two main purposes: (i) developing a predictive model that estimates Se adsorption efficiency on Fe modified biochars. This model considers various factors such as biochar properties, pH, temperature, and Fe load (%) and (ii) identifying the relative significance of each influencing component involved in the adsorption process.

2 Methodology

2.1 Data collection

Biochar and Fe-modified biochar were chosen as the target adsorbents in this work, and high-quality, representative raw data were used to build a prediction model. Selenium adsorption data by biochar and modified biochar (Fe-biochar) were gathered impartially from relevant published Google Scholar papers during the last decade. The compiled data consisted of a data set comprising of of 40 samples. This data set, which included two main aspects: (a) the removal capacity of both pristine biochar and Fe-biochar for aqueous Se, and (b) the assessment of characteristics for Fe-biochar and corresponding pristine biochar following iron impregnation, had no initial bias regarding data authenticity. Plot Digitizer 2.6.8 was used to retrieve information directly from tables or from figures of the relevant published research articles and supportive data. The unaddressed missing information was excluded due to the fact that the general approaches for allocating the missing numbers may produce errors for merely hundreds of samples in our information. To ensure consistency, all units were standardized for each variable using tools like Plot Digitizer, the feedstock properties, operational parameters, reaction solvent, and separation solvent in datasets, collected from the relevant publications. Twenty distinct types of feedstock were gathered in detail, including biomass, food waste, manure, microalgae, etc. To maximize the data set's trustworthiness and eliminate inconsistent values, all of the units for each variable were standardized.

After conducting a thorough review and collecting relevant information from literature, the data points for Se adsorption within the described equilibrium level ranges were estimated. It is crucial to have high-quality and representative data source for accurate prediction models. However, despite numerous adsorption studies, only a small fraction of them meet the criteria for developing reliable prediction models. This limitation is due to either poor data quality or insufficient provision of critical characteristics. To ensure robust prediction models, it is essential to measure and report adsorbent parameters such as surface area and total pore volume during the adsorption process (additional properties such as macropores and micropore volume could also be obtained, if available) (Zhang et al. 2020). Factors influencing the Se removal potential (mg/kg) employing biochar and Fe-biochar were investigated. These factors were categorized to understand their unique roles in Se removal. The reaction circumstances, such as solution pH (pHsol), temperature (T, °C), and Fe-biochar dose were examined. In the meantime, data on the physicochemical characteristics of Fe-biochar and corresponding pristine biochar including elemental analysis results (C%, O%, H%) were obtained to see how biochar characteristics changed before and after iron impregnation. To ensure reliability of our data, a Permutation test (Y-scrambling) was performed, which showed that our data was not randomly selected. However, there was a slight difference in variance between the data as depicted in Fig. S1. Surface chemical parameters from XPS analysis, as well as % functional groups data, were not examined in the current investigation since they were not adequate in the relevant literature.

2.2 Preliminary data analysis with random forest (RF), support vector regression (SVR) and SHAP models

Machine learning (ML) models focusing on diverse material characteristics were designed to highlight the most fundamental pathways of Se adsorption. The Model BP (basic properties) was developed using the basic properties of biochar and Fe-biochar (i.e., C, O, and Fe load %) and reaction circumstances (i.e., pHsol, temperature). This model can be used to predict the behavior of biochar and Fe-biochar under different conditions, and can help optimize their use for various applications. The development of Model BP is a significant achievement in the field of biochar research. Fe load refers to the amount of iron present in the Fe-biochar, which can significantly impact its effectiveness as an adsorbent. The reaction circumstances considered in Model BP include pHsol and temperature. pHsol refers to the pH of the solution in which the biochar or Fe-biochar is present, and it is an important factor in determining the adsorption capacity of these materials. Temperature is another important factor that affects adsorption, as higher temperatures can increase the rate of chemical reactions. Overall, the development of Model BP is a crucial step in understanding the behavior of biochar and Fe-biochar under different circumstances. This model can be used to optimize their use as adsorbents for various contaminants and pollutants, allowing for more efficient and sustainable wastewater treatment and environmental remediation strategies.

It's worth noting that every data set contains effective standards for all the variables in every ML model's datasets, with exception of the ones that were removed due to missing data. Se removal data by biochar or Fe-biochar were chosen as the test group from common sets in models to assess their prediction results independently, while the remaining data in each model were randomly assigned to two groups: training set and test set. The information in the training group was employed to train ML models by means of RF algorithm, which is an ensemble ML technique that improves predictive accuracy and avoid over-fitting by averaging the acts of decision trees on numerous sub-samples of datasets (i.e., bagging theory) (Torrisi et al. 2020). Bestowing to the response of regression coefficient (R2) or root-mean-square error (RMSE) in the validation group, the parameters might be modified using trial-and-error and grid tools (Zhu et al. 2022). The RF analysis was done by using the R software. For RF regression analysis, the RF library was used. The model was tuned by using caret library. The other libraries used for RF were dplyr, reptree, devtools. The Rpubs and GitHub websites were used for codes in R. All the codes for RF are available at these two websites (https://rpubs.com and https://github.com).

Support Vector Regression (SVR) is a ML algorithm that is used for regression tasks. SVR is particularly useful in situations where traditional linear regression techniques are unable to capture the non-linear relationships between the variables. The SVM can be applied to the target variable by using the alternative loss function and kernel function (Shi et al. 2021). The SVR equation using the training data set with n samples is as follows;

$$T=\{\left({x}_{1},{y}_{1}\right),\left({x}_{2},{y}_{2}\right),\dots .,\left({x}_{n},{y}_{n}\right),{x}_{i}\epsilon {R}^{d},{y}_{i}\epsilon {R}^{d}$$
(1)

The following linear function models the non-linear relation between input and output:

$$f\left(x\right)={w}^{T}\varnothing \left(x\right)+b$$
(2)

where \(f\left(x\right)\) refers to the output and \(\varnothing \left(x\right)\) is the mapping function (Fatahi, et al. 2022). SVR was also performed in R, using MASS, Caret, and Kernlab libraries. The codes for SVR are also available at the websites as mentioned for RF.

The SHAP (SHapley Additive exPlanations) model is a machine learning technique used to interpret complex models and explain the contribution of each feature to the predicted outcome. The SHAP values provide a measure of the impact that each component has on the model's output (Fatahi, et al. 2022). To use the SHAP model in the context of Se adsorption by modified biochar, a ML model can be trained to predict the adsorption capacity of different types of biochar based on their physical and chemical properties. The SHAP model can then be used to identify which features of the biochar are most important in determining its adsorption capacity. By using SHAP values, researchers can identify which features are most important for contaminant adsorption and how each feature contributes to the overall adsorption process. This information can help in the development of more effective biochar-based remediation strategies. Shapely value for the model can be calculated by the following equation:

$$\varnothing i\left(f,x\right)=\sum_{S\complement M\backslash i} \frac{\left|S\right|!(\left|M\right|-\left|S\right|-1! }{|M|i}[f\left(S\cup \left\{i\right\}\right)-f\left(S\right)]$$
(3)

SHAP values were employed to enhance the transparency and comprehensibility of the XGBoost model, specifically to investigate the ranking of the most influential variables in Se adsorption.

The SHAP XGboost analysis was done by using the caTool, caret, xgboost, and SHAP libraries in R. The PDP analysis with SHAP was done by using the e1701 library.

2.3 Analysis of respective importance of variables and partial dependence plots

The information contained in the RF, SVR and SHAP XGboost model was mined using feature importance and partial dependence plot (PDP) approach. The impurity reduction of nodes and the mean over decision trees were used to perform the feature importance analysis. This allowed us to grasp the importance of input variables in influencing target variables. In feature significance analysis, measuring the weighed contamination reduction of the entire nodes and averaging across all decision trees may assist in explaining the role of different input factors for the variance of target variable (De Clercq et al. 2020). The information acquired from the rank of feature importance, on the other hand, was limited. The PDP analysis revealed that the target variables' dependence relationships with critical input factors seemed to be more considerable (Zhao et al. 2021). It should be emphasized that in the data-intensive region, the PDP study is more accurate. Bestowing to the concentration of pikes on the x-axis, an over explanation of partial dependent plot evaluation in areas with little or no data must be spared. The PDP method was also used to determine the mutual effects of any two input factors on predicted consequence. The PDP analysis in all models was done by using e1701 library in R.

2.4 Model performance

To evaluate the effectiveness of the RF, SVR, and SHAP XgBoost models in predicting Se adsorption, the predicted values were compared to the actual values. This was done by dividing the data into two subsets, 80% of which were used as training data set, and 20% for the validation data set. Then the comparison was made by calculating the R2 and root mean square error using a simple linear regression analysis, which provided a measure of the model performance. The R2 values represented the degree of fit between the experimental and predicted values of Se adsorption for all three models, indicating how well the observed data fit the models. By analyzing these metrics, we gained insight into the effectiveness of each model in predicting Se adsorption and could make informed decisions about which model to use in future analyses. The R2 was represented by the following Eq. 4:

$${\mathrm{R}}^{2}=1-\frac{{\sum }_{i=1}^{N}({Y}_{i}^{exp}-{Y}_{i}^{pred}{)}^{2}}{{\sum }_{i=1}^{N}({Y}_{i}^{exp}-{Y}_{avg}^{exp}{)}^{2}}$$
(4)

In this equation, the experimental and predicted values are represented by \({Y}_{i}^{exp}\) and \({Y}_{i}^{pred}\) while the average of actual values is represented by \({Y}_{avg}^{exp}\).

$$\mathrm{RMSE}= \frac{1}{N}\sqrt{\sum_{i=1}^{N}({Y}_{i}^{exp}-{Y}_{i}^{pred}{)}^{2}}$$
(5)

The root mean square error of the data was calculated by using Eq. (5). The calculation of R2 and RMSE is based on Eltohamy et al. (2023).

All the training data from the three models were compared with the real data, and the correlation coefficient was determined by using the Tylor diagram in Origin Lab (2023 USA).

3 Results and discussion

3.1 Incorporating surface characteristics information into machine learning algorithms improves prediction accuracy

Figure 1 depicts the predicted removal capacity for Se removal with reference to experimentally measured values. The regression trend is represented by lines with shaded patches representing 95% confidence intervals. The basic properties model, constructed on the basis of basic surface features of biochar and Fe-modified biochar (i.e., carbon % and oxygen %) and reaction circumstances (Fe load, temperature and pH), has shown considerable strength (p < 0.05) to predict the removal capacity for Se in the test group with R2 of 0.98 and RMSE of 0.35 mg/kg (Fig. 1). The R2 value of 0.98 indicates that the model explains 98% of the variability in the data, which means that it is highly accurate in predicting the removal capacity of Se by biochar and Fe-modified biochar. The RMSE value of 0.35 mg/kg indicates that the average deviation between the predicted and actual values is relatively small, which further supports the accuracy of the model. Therefore, the RF model can be considered a robust and reliable tool for predicting the removal capacity of Se by biochar and Fe-modified biochar, which could be valuable in designing efficient and cost-effective treatment strategies for Se-contaminated water or soil. Moreover, the information revealed through XPS data indicates highly useful surface chemical characteristics that may serve as model input and could lead to enhanced prediction power of BP Model. The appropriate use of surface chemistry data from XPS statistics as a model input could explain model BP's acceptable prediction power (Zhu et al. 2022). Furthermore, by introducing virtual amount of functional groups at surface, the prediction accuracy can further be enhanced, which is still a research gap that has to be addressed.

Fig. 1
figure 1

Selenium predicted removal capacity versus experimentally determined removal capacity values with RF, SVR ans SHAP. The lines represent the regression lines and the shaded area indicates 95% confidence interval. The R2 and RMSE were calculated based on the prediction in the test group

The SVR model was also used to predict Se adsorption. The results showed that SVR was the most effective model for Se adsorption, with an R-squared value of 0.98 and RSME value of 0.14 mg-kg−1. Moreover, the SHAP model was used to explain the contribution of each feature to the prediction of Se adsorption. Based on the RMSE values given in Fig. 1, it appears that SVR has the lowest RMSE value of 0.14 mg-kg−1, which indicates that it has a lower prediction error and is potentially more accurate than RF and SHAP XGBoost models.

3.2 Effect of investigated variables on Se adsorption

Figure 2 depicts the adsorption efficiency by Fe modified biochar in Rf, SVR and SHAP XGBoost as determined by the significance level and the five input parameters. The features were sorted according to their relative ranks, and the Pearson correlation coefficient (PCC) for each parameter was noted and compared. The PCC reflected the linear dependency of Se adsorption on each parameter, and the relative relevance was computed using a RF, SVR and SHAP XGBoost that included both linear and nonlinear dependences (Panapitiya et al. 2018). Prior to feeding the features to each model, each feature was scaled to zero mean with unit variance in the first phase. It was done to ensure that no particular characteristic dominated the algorithm's objective purpose, ignoring the effect of other, lower-variance features. Based on all the three models used, the results showed that Oxygen (%) was the most influential factor controlling Se adsorption. According to the results of the relative importance method, the following criteria are listed in a descending order of importance: oxygen (%), carbon (%), temperature, pH and Fe. The order identified in the current study shows a similar trend that has been reported earlier where the effects of reaction temperature on Se adsorption was analyzed it was found that temperature had a stronger effect than Fe dosage (Lee et al. 2021). Adsorption performance is influenced by certain parameters such as solution pH and temperature (Meng et al. 2014; Wahid et al. 2013). The surfaces of biochar for charge allocation and ion exchange potential seems to be greatly influenced by solution pH, which governs the adsorption or precipitation process on the surface (Ho et al. 2017). Because the adsorption on biochars is a thermodynamically beneficial process, the influence of temperature seems understandable.

Fig. 2
figure 2

Importance rank of material properties for Se adsorption capacity by Fe modified biochar in RF, SVR and SHAP XGBoost. The explanatory variables are oxygen (O), carbon (C), temperature (Temp), pH and Iron (Fe) for all the three models

Pearson correlation coefficients (PCCs) between any two variables are shown in Fig. S2, while the GGally package's ggpairs () method was also used to create a stunning scatterplot matrix which produces a graph in a matrix format (detail is given in supplementary information Fig. S3). PCC can be applied to determine the relationship between any input variables. Correlation model manifested a strong interactive effect of selenium with oxygen followed by carbon. The former had direct relationship which corresponded to the rise in selenium adsorption, as depicted by coefficient of Pearson correlation (r = 0.7). On the other hand, the latter, i.e., carbon, appeared to have inverse relationship with selenium as the value of coefficient is as depicted in Fig. S2. Other parameters, such as Fe and pH had relatively weak correlation with selenium; nevertheless, the correlation of selenium with temperature was negative but less significant, indicating only with an increasing trend in temperature, the selenium removal would tend to decrease.

Reduction in adsorption capacity with increasing temperature may be due to increased thermal energy, which can weaken the interactions between selenium and the adsorbent surface. Additionally, the increased equilibrium adsorption time at higher temperatures may be due to the decreased adsorption capacity, which requires longer contact time to reach equilibrium (Zhang et al. 2020). In a thermodynamic adsorption study, researchers investigate how the adsorption of selenium onto Fe-modified biochar is affected by changes in temperature, pressure, and other factors that influence the chemical equilibrium of the system. Several studies have conducted thermodynamic adsorption experiments to investigate the adsorption of selenium onto Fe-modified biochar. For example, Hong et al. (2020) investigated the thermodynamics of selenium adsorption onto Fe-modified biochar using the Langmuir and Freundlich isotherm models. They found that the adsorption process was spontaneous and exothermic, meaning that it released energy as the selenium was adsorbed onto the biochar. The authors also reported that the adsorption capacity increased with increasing initial selenium concentration and decreasing temperature.

Similarly, Lee et al. (2021) conducted a thermodynamic study of selenium adsorption onto Fe-modified biochar and found that the adsorption process was thermodynamically feasible and spontaneous. They also reported that the adsorption capacity increased with increasing initial selenium concentration and decreasing temperature, and that the adsorption was well described by the Langmuir isotherm model. As the reaction temperature increased in the range (15–35) ℃, selenate adsorption on Fe/CM -biochar showed an endothermic and nonspontaneous reaction.

Overall, these thermodynamic adsorption studies suggest that the adsorption of selenium onto Fe-modified biochar is influenced by several thermodynamic factors, including temperature, initial selenium concentration, and the specific adsorption isotherm model used. By understanding these factors, researchers can optimize the adsorption process for efficient selenium removal from contaminated water sources.

The rise in alkaline conditions was attributed to high oxygen contents followed by the interaction between selenium and oxygen in that a rise of one attributed to a corresponding increase of the other one. This implies that Se adsorption is high under increased O (%) content and rise in alkalinity. In literature, it was found that Fe modified biochar’s surface becomes negatively charged after the pH rises, resulting in increased electrostatic repulsion among anionic Se forms and adsorbents. Reduction in adsorption capacity at higher pH values was caused by electrostatic repulsion between Se oxyanions and negatively charged surface of the adsorbent (Zoroufchi Benis et al. 2022). Franzblau et al. (2014) found that Se adsorption mechanism on iron oxide is highly influenced by the solution pH. Under alkaline and acidic pH levels, Se(VI) forms outer-sphere complexes and inner-sphere complexes on Fe oxide, respectively (Peak & Sparks 2002; Zhang et al. 2018). The stronger strength of the inner-sphere complex's bonding, compared to that of the outer-sphere type, could be another reason for reduced selenite adsorption under alkaline circumstances (Lee et al. 2021). One other factor contributing to reduced adsorption capacity could be competition for adsorbent surface among negatively charged selenite and hydroxyl ions at alkaline pH circumstances (Zoroufchi Benis et al. 2022). Temperature, on the other hand, shows a significant positive correlation with carbon concentration and carbon looked to be negatively associated with carbon concentration.

3.3 Partial dependence plot analysis results

Figures 3, 4 and 5 indicate a varying trend between partial dependence of Se removal and other studied parameters that affect the Se removal for all three models. According to the RF, SVM and SHAP PDPs, we can see that the partial dependence of se adsorption decreases as C% increases. This means that as the carbon content in the system increases, the predicted Se adsorption decreases. This information is useful in understanding the relationship between Se adsorption and C% in the models and can be used to inform decisions about how to optimize its adsorption in a real-world scenario.

Fig. 3
figure 3

Partial dependence plots of Se removal on C (%), temperature, Fe (%), pH and O (%).The partial dependence plot was obtained on the basis of RF modeling

Fig. 4
figure 4

Partial dependence plots of Se removal on C (%), temperature, Fe (%), pH and O. The partial dependence plot was obtained on the basis of SVR modeling

Fig. 5
figure 5

SHAP XGBoost Partial dependence plots of Se removal on C (%), temperature, Fe (%), pH and O. The partial dependence plot was obtained on the basis of SHAP XGboost modeling

According to the interpretation of the one-factor PDP model, the iron content of Fe-biochar demonstrated a declining trend in Se removal capability in RF and SVM, but an opposite trend in SHAP, which may be attributable to an increasing iron dose onto biochar combined with greater quantities of iron percolated into solution. The quantity of Fe leaching rises when more Fe is added. An increase in the quantity of solvent utilized has a negative impact on Se removal and typically improves the Fe leaching process (Satyro et al. 2021). Similarly, when the quantity of adsorbent increases, the adsorption capacity of Se decreases, because adsorbates (Fe) do not completely dominate the absorbable regions on the biochar surface (Ali et al. 2020; Lee et al. 2021). Nonetheless, in Model BP, the proportionate relevance of Fe level in the current study was rated fifth (Fig. 2). In RF and SVM, an increased Fe (%) indicates that the feature is more important in determining the model's output. As a result, removing a feature that has a higher Fe (%) would have a greater impact on the model's performance. Therefore, the Se removal partial dependence would decrease as the Fe (%) increases. However, in SHAP, an increased Fe (%) indicates that the feature has a greater impact on the model's output. Therefore, when a feature with a higher Fe (%) is removed, the model's performance would decrease significantly. This results in an increase in the SHAP value, which is a measure of the feature's impact on the model's output. In summary, the increased Fe (%) decreases Se removal partial dependence in RF and SVM, while it increases in SHAP because RF and SVM measure the importance of a feature differently from SHAP.

The Tylor diagram (Fig. 6) shows that the SVM data is 96% correlated to the real data, while the RF and SHAP XGBOOST data have a weaker correlation with the real data (Fig. 6). The good performance of SVM can be proved by its strong correlation with the real data set.

Fig. 6
figure 6

Tylor diagram of SVM, RF, SHAP XGboost with the real data set

Similarly, the influence of pH on the partial Se elimination trend had also shown a wide range of variations. One notable tendency was a rise in pH, with a minor decrease at pH 6 and a maximum increase at pH 8. This increase in both (Fe load and pH value) had led to the development of an inverse or weak relation with the process of Se removal, which was also confirmed in Pearson correlation results where Fe and pH showed relatively weak correlations with Se (Fig. S2). Furthermore, the O (%) followed a decreasing trend initially and lowest partial dependence of Se was recorded at 20% O and then increased sharply to the maximum with further increase in O% up-to 30%, remained stable for a while and then decreased with further rise in O%.

PDPs with two input features of interest show the interactions among the two features. For example, the two-variable PDP in the Fig. 7 shows the dependence of the Se removal on O (%) with Fe and Temperature. We can clearly see an interaction between the two features: this implies that Se adsorption is high under increased O (%) content with a high temperature. For lower temperatures, both temperature and Fe load has an impact on Se removal.

Fig. 7
figure 7

Two-way partial dependence plots of Se removal on O (%) with Fe and Temperature

In summary, the SVR model was highly effective for predicting Se adsorption, with an R-squared value of 0.98 and an RMSE value of 0.14 mg-kg−1, indicating potentially higher accuracy than the RF and SHAP models.

3.4 Limitations and future outlook of the present study

It is crucial to acknoledge the limitations of using a small data volume in the context of a ML approach for predicting Se adsorption by Fe-modified biochar to ensure the credibility of the research. The current section acknowledges these limitations and discusses potential implications for model performance:

In this study, we recognize that the use of a relatively small data volume for training the ML model may have limitations. Although we have carefully selected and curated the data, the limited availability of data samples of Se adsorption by Fe-modified biochar may impact the performance and generalizability of the model. It is important to note that the small data volume used for model training may result in potential biases and uncertainties in the predictive accuracy of the model.

Furthermore, the small data volume may limit the model's ability to capture the full complexity and variability of the Se adsorption process, which can be influenced by various factors such as biochar properties, solution chemistry, and environmental conditions. As a result, the predictive performance of the model may be compromised, and caution should be exercised when interpreting and applying the model's predictions to real-world scenarios.

Despite these limitations, our study provides valuable insights into the potential of Fe-modified biochar for Se adsorption. Future research with larger and more diverse data sets could further improve the accuracy and reliability of the predictive model. Additionally, conducting rigorous validation tests and sensitivity analyses on the model's performance would help better understand its strengths and limitations. Moreover, surface chemical parameters from XPS analysis, as well as % functional groups data, could also provide valuable information for further study since they are not adequate in the relevant literature.

4 Conclusion

The concept of using ML to sort massive amounts of data in order to find important information has recently been applied to environmental remediation, specifically the science-based design of a 'green' carbonaceous and effective functional material (i.e., biochar and Fe modified biochar) having maximum Se removal capacity. In the present study, ML models, particularly the RF, SVR and SHAP techniques, were utilized to predict Se adsorption by modified biochar. All the three models demonstrated high accracy and predictive performance for Se removal capacity.  The SVR model showed particular effectiveness in predicting Se adsorption, potentially surpassing the RF and SHAP models due to the limited size of the data. Feature analysis and partial dependence plot analysis revealed that oxygen (%), carbon (%), temperature, pH, and Fe were the most significant factors influencing Se adsorption in all three models. The results revealed that the most significant factor affecting Se adsorption was oxygen (%).) The PDP investigation highlighted the influence of every single relevant parameter on target variable as well as the interconnections among these components during the Se adsorption process. Researchers could use the relative relevance of factors to steer them in the right direction for improved Se treatment of real water and wastewater.

The use of ML for environmental remediation, specifically in designing a green carbonaceous material for effective removal of selenium, has potential positive environmental implications. By using ML to predict the Se adsorption by Fe modified biochar, researchers are able to optimize the process of removing selenium from water and wastewater, which can have a significant impact on reducing selenium contamination in the environment.

The application of ML for environmental remediation can also contribute to the development of more efficient and effective technologies for pollution control and prevention. This can help minimize the environmental impact of industrial activities and reduce the amount of harmful pollutants released into the environment.

Overall, the use of ML in environmental remediation has the potential to improve the sustainability of industrial activities and promote the protection and preservation of the environment.