1 Introduction

Heavy metals (HMs), which are serious toxic pollutants to water and soil, have negative effects on human health and the ecological environment. There are various treatment technologies available for removing HMs from solution, such as adsorption, photocatalysis, electrochemistry, and membrane separation (Hao et al. 2023; Gu et al. 2022; Liu et al. 2023a; Uliana et al. 2021). Biochar is an attractive absorbent material, which is carbon-rich and generated via pyrolysis of biomass under no oxygen or oxygen-limited conditions (Huang et al. 2019). Moreover, it is environmentally friendly and cost-efficient and has many predominant physical and chemical properties, such as high surface area, unique pore structure, abundant functional groups, and stable framework. Hence, the application of biochar to remove HMs in the environment is extensive (Liu et al. 2022). Moreover, the adsorption capacity can be improved through various modification techniques, including magnetizing biochar, metal impregnation, and plasma-treated biochar, which increase the utilization of biochar adsorbents (Fang et al. 2023).

The adsorption capacity of biochar for HMs is related to many factors, including various adsorption mechanisms (e.g., surface complexation, electrostatic interaction, precipitation), biochar properties (e.g., surface area, functional groups, pore size), and experimental conditions (e.g., time, temperature, concentration) (Qiu et al. 2022). However, many possible combinations of these methods result in traditional controlled variable experimental techniques, which are not ideal routes for rapid screening of high-performance biochar. Hence, machine learning, as a pop and data-driven research paradigm for exploring complex relationships and determining features of adsorption performance, is warranted (Wei et al. 2023, 2024). We searched published articles with machine learning and biochar as keywords on the Web of Science website up to 2023. Using SiteSpace software (citespace.podia.com), the obtained articles are visualized in Fig. 1. The number of related studies has increased, and more attention has been given to HMs in recent years. Cao et al. (2016) pioneered the employment of the artificial neural network (ANN) model and least squares support vector machine (LSSVM) model to predict biochar yield during the pyrolysis of cattle manure. Although several correlation analyses, meta-analyses, and interpretability analyses have been carried out to evaluate the adsorption capacity of biochar for HMs, machine learning model-based prediction studies are still lacking (Wang et al. 2023). Zheng et al. (2022) and Almalawi et al. (2022) developed hybrid deep learning models to improve the predictive performance of various descriptors for the adsorption capacity of HMs. A better strategy for building adsorption models was to use hybrid models and optimization algorithms. Additionally, extensive studies have focused more on obtaining deeper insights into the prominent characteristics of adsorption properties (Leng et al. 2022; Sun et al. 2022b). Although the factors influencing biochar adsorption capacity are complex and diverse, the interpretability of machine learning allows one to understand exactly the causes of biochar adsorption in association with model predictions. Black-box models (e.g., ANNs and SVMs) are considered to lack transparency and reliability. As an explainable artificial intelligence method, rough set machine learning is used to improve the interpretability of the predicted biochar surface properties (Ang et al. 2023). The interpretability of models could also improve through sensitivity analysis of experimental features (Chen et al. 2022). The most common methods involve the use of a partial dependence plot (PDP), local surrogate (LIME) and Shapley value (SHAP), which can be used to determine the importance of input-influencing features on adsorption capacity. In turn, these methods further add to the interpretability of biochar adsorption mechanisms and are instructive for future development directions. For example, an adaptive network-based fuzzy inference system (ANFIS) model was used to predict the adsorption removal efficiency of arsenate (As(III) by Al-Yaari et al. (2022). The statistical parameter R% was used to interpret the relative importance of input features, and the results established that the most dominant featurea were pH, initial As concentration, and contact time. Da et al. (2022) showed that the multilayer perceptron model with two hidden layers had the best prediction effect on biochar for uranium adsorption capacity by comparing four machine learning methods. With the permutation feature importance method, they found that the uranium adsorption capacity was highly dependent on the specific surface area (SA), and the optimal range of SA was 500 ~ 1200 m2g−1. Zhu et al. (2022) developed models with interpretable random forest algorithms and obtained dependent relationships between target properties and input critical descriptors via PDP analysis. The results suggested that iron impregnation increased the C–O and C=O ratios of the iron-biochar composite, which in turn better facilitated Cr(VI) removal by the iron-biochar. Nevertheless, readily available macroscopic properties as descriptors have received increased attention in existing machine learning research, while the exploitation of more informative adsorbent descriptors (e.g., atomic number, topological structure, molecular fragment, surface area, energy) for building model processes is still lacking. In addition, the great economic cost of these experiments and the complexity of biochar properties have led to the building machine learning models with insufficient data or low-quality data. In these cases, the prediction results can be biased, non-convergent, unstable, or overfitted. This  is also a hurdle to jump of current machine learning models. It is suggested that surface functional groups are more important than elemental composition for evaluating the adsorption capacity of biochar for HMs. This research has several limitations and may lead to uncertainty in the results owing to the lack of data directly relevant to HM adsorption (Palansooriya et al. 2022). Zhang et al. (2023) comprehensively and systematically reviewed recent works on the application of biochar as an adsorbent for pollutant removal via machine learning. The data scale, the effectiveness of the datasets, and the construction of corresponding databases need to be carefully considered in future research.

Fig. 1
figure 1

a Year-wise biochar and machine learning publications with the number of citations. b The co-occurrence map of keywords. The node size indicates the frequency of keywords published in articles; the thickness of lines represents collaboration strength; different colors represent different years

2 Descriptors employed in machine learning for data-driven adsorption studies

The first stage in applying machine learning to chemistry research is to determine the characteristics and information of chemistry that are acceptable to machine learning. The selection of suitable input descriptors (features) plays a critical role in improving the accuracy of machine learning prediction models and uncovering the key features that influence adsorbent capacity and selectivity. Thousands of different descriptors have been mined depending on the objective application and machine learning algorithm. For example, biochar yield is closely related to biomass characteristics and pyrolysis conditions, the proximate composition and elemental composition of biomass materials can be regarded as input descriptors, and experimental conditions such as temperature, heating rate, and retention time are also considered (Li et al. 2023a). As shown in Fig. 2, descriptors can be classified into different categories. According to the data source, the descriptors can be classified into experimental-based descriptors, theory-guided descriptors, and descriptors for combining experimental data and theoretical calculations. Descriptors can also be divided into qualitative and quantitative descriptors. Qualitative descriptors, also known as molecular fingerprints, encode molecular characteristics using MACCS keys, Morgan fingerprints, daylight fingerprints, etc.. The latter abstracts the molecular structure into descriptors by field or graph theory methods. According to the different data types, the descriptors can be divided into integers (i.e., atomicity), real numbers (i.e., molecular weight), and vectors (i.e., dipole moment). The dimensions of the molecular structure required to calculate descriptors include zero-dimensional descriptors, one-dimensional descriptors, two-dimensional descriptors, and three-dimensional descriptors. These descriptors can also be divided into topological descriptors, geometric descriptors, composition descriptors, and molecular property descriptors according to the difference in physical meaning. Given that there are no hard or absolute rules to govern this selection, developing and adopting new descriptors to enhance the usability of machine learning algorithms and promote the process of rationalizing in the field of chemistry with a more open attitude is necessary.

Fig. 2
figure 2

A rough classification of descriptors (in the box) and some corresponding examples (in the circle)

3 Machine learning applications in the removal of HMs by biochar

Using machine learning can effectively predict the properties and capacities of biochar, discover their underlying reaction mechanisms and complex relationships, and help design new materials. For the removal of HMs by biochar, machine learning-aided prediction helps with the rapid innovation and screening of high-performance materials. The experimental remediation ratios and the corresponding prediction results of the ANN model are shown in Fig. 3a, and further sensitivity analysis of the ANN model is shown in Fig. 3b. The ANN model performance is acceptable, and the contributions of the input descriptors for modelling are were compared. The contributions of biochar properties, especially biochar pH, were greater for immobilization efficiency than for soil physiochemical properties and other factors (Sun et al. 2022a). Indeed, as different machine learning models are being used to correlate input descriptors to HM removal interactions with biochar, selecting appropriate descriptors for modelling is a challenge. For example, genetic programming yields good predictions and further yields a simple mathematical expression for the adsorption process to determine the quantitative relationship between biosorption capacity and input descriptors. The descriptors in this study were grouped into biochar characteristics, biosorption conditions, initial concentration ratio of HMs to biochar, and heavy metal characteristics. Finally, the initial concentration ratio of heavy metals/biochar and the carbon content of the biochar were found to be the most influential descriptors by sensitivity analysis (Dashti et al. 2023). As mentioned above, discussions on the sensitivity analysis of machine learning models are always used to guide model understanding and parameter importance. Machine learning methods have grown to become powerful tools for uncovering hidden relationships. The sensitivity of machine learning models, that is, the parameters in the model may vary under different input conditions and are vital in the performance and output results of models. Sensitivity analysis is a popular method for evaluating the performance of a model when parameters change. This approach can provide insight into the impact of input parameters on the target; for instance, by calculating the gradient, relevancy factor, partial derivative, or other relevant covariates, one can determine the responses of the model to variations in different parameters. In addition, there are more common methods for evaluating the sensitivity of models, which can be divided into parameter sensitivity analysis, feature importance analysis, local sensitivity analysis (LSA), global sensitivity analysis (GSA), the gradient method, backpropagation sensitivity analysis, and Monte Carlo simulation. Zhao et al. (2021) demonstrated a new approach through the application of a kernel extreme learning machine (KELM), kriging models, and local sensitivity analysis to identify sensitive parameters influencing the adsorption process. The LSA usually studies a single input parameter once, but the GSA concurrently studies the variations across the entire spectrum of all input parameters. When a global understanding or consideration of the interactions between parameters is needed, GSAs, such as Sobol indices, can be used to measure the sensitivity of the whole input space (Sun et al. 2022a).

In addition, for modified biochar, suitable descriptors can be used to investigate adsorption mechanisms and aid engineered biochar design. For example, high-efficiency removal is often difficult to achieve with pristine biochar due to the electrostatic repulsion between the negative charge on the biochar surface and oxyanions. Fe possesses strong affinities for As and is an attractive modification metal for use in decorating biochar. Biomass characteristics, As species, initial concentrations, and adsorption conditions were used as input descriptors for machine learning modelling. Among them, As species were the first to be considered input descriptors. However, As species are not important because As(III) and As(V) have similar removal mechanisms on biochar. The As(V) adsorption capacity as a function of the Fe content is positively correlated according to the partial dependence plot analysis. Statistical comparisons revealed that the Fe content, as a direct factor in As adsorption capacity, was relatively limited. The possible interactions between As(III), As(V), and Fe-modified biochar through FeOH and FeOH2+ groups may be the dominant mechanism (Liu et al. 2023b). As shown in Fig. 3c, the partial least squares path model (PLS-PM) also quantified the direct and indirect effects of key descriptors on the HM immobilization ratio. The electrical conductivity (ECBC of biochar, ECsoil of soil), cation exchange capacity (CECBC of biochar, CECsoil of soil), organic carbon (OC), and biochar application rate (RateBC) were considered input descriptors. Soil pH and OC content directly positively influence the immobilization of Cd(II). At low soil pH, H+ and Cd2+ undergo electrostatic repulsion by competing with adsorption sites, and Cd2+ precipitates and reacts easily with abundant oxygen-containing functional groups in an alkaline environment. For the immobilization ratio of Zn(II), higher C and N contents are better for Zn(II) immobilization, and an increase in the surface area of biochar (SSA) will provide more activated adsorption sites. Pb(II) immobilization can occur through precipitation and cation exchange. The OC content promoted Pb(II) immobilization because the surface complex reaction occurred with C–π and –COO–. Similarly, the Cu(II) immobilization ratio should increase with increasing C and N contents because the complex binding sites are provided by carbon-containing functional groups. The results of the statistical analysis showed that cation exchange is important for Cu(II) immobilization (Guo et al. 2023). The metals can be divided into cationic metals and anionic metals according to their species. Investigations of the influence of input descriptors on modelling have shown some differences between them.

A meta-analysis, a scientific statistical analysis algorithm, was used to explain the immobilization of different anionic metals. As shown in Fig. 3d, the mechanism underlying the immobilization of four anionic metals on biochar in soil mainly includes the following steps: (1) first, surface complexation of anionic metals with functional groups (e.g., C–O groups, O=C–O groups, C–OH groups, etc.) occurs on biochar; (2) second, electrostatic interactions are caused by positive and negative charges on metal ions and adsorbents, which depend on pH and speciation of ions. Explanatory variable analysis also indicated that biochar pH and soil pH are the key factors influencing HM immobilization; (3) third, precipitation or coprecipitation can occur via the synergistic effect of cations and anions (Zhang et al. 2022b). In addition, Zhang et al. (2022a) used X-ray micro-CT imaging to establish a novel 3D in situ visualization method for Pb(II) adsorption on biochar particles. The images were reconstructed and subsequently segmented via the K-means clustering unsupervised machine learning algorithm. Semiquantitative 3D in situ visualization analysis of the rendered images revealed the mechanism of Pb(II) adsorption on the different biochar particles. Coconut shell-activated char had a low adsorption capacity for Pb(II), which was mainly due to its neutral pH value, thus limiting precipitation and π electronic interactions. Micro-CT showed that the lowest Pb(II) concentration in the core of the particles was inseparable from the smallest pore diameter and largest micropore volume. There was sufficient Pb(II) diffusion in the rice husk biochar with a thin-walled porous morphology. Based on the typical Crank model, the intraparticle diffusion of adsorbed particles may be explained by a function of time and the radial distance from the surface to the centre of a particle. For the wheat straw biochar, the concentration was the highest in the outer layer of the particles, and the concentration decreased outside of the ellipse. This was attributed to the relatively uniform microstructure. This 3D in situ visualization provides new insight into the adsorption mechanism via image representation.

Fig. 3
figure 3

a Prediction of the immobilization efficiency with ANN (Sun et al. 2022a). b Sensitivity analysis of ANN model (Sun et al. 2022a). c The direct and indirect effect of selected biochar properties, soil properties, and experiment conditions on individual HM (Cd, Cu, Pb, and Zn) immobilization ratio conducted by the PLS-PM (Guo et al. 2023). d) chematic diagram of immobilization mechanism of biochar on anionic metal(loid)s in the soil (Zhang et al. 2022b)

These studies built machine learning models based on biochar properties, experimental conditions, and pollutant characteristics to predict the HM adsorption capacity of biochar or engineered biochar. Then, the direct and indirect influences of the input descriptors are revealed via interpretability analysis, which contributes significantly to developing and exploring novel viewpoints on material design, mechanism analysis, and process optimization. However, many more likely contributing descriptors, such as pHpzc, metal content, surface functional group, reaction energy, and experimental spectral data, have yet to be detected. For constructing a reliable and accurate model, the relevant data reported in the published literature are very sparse. Therefore, for a more comprehensive and greater understanding of the HM interaction mechanism, implementing related experiments and encouraging additional feature analyses of the modelling process in further studies are urgently needed. Compared with practical big data problems, the currently collected dataset has relatively sparse, discrete, and noise data, which is a common difficulty that exists in the intersection between machine learning and information on physicochemical properties of materials, possibly due to high experimental costs and errors. Another key challenge in machine learning applications is choosing and building a suitable model, especially for small datasets. Although many studies have demonstrated the feasibility of using machine learning algorithms with relatively small datasets, increasing the amount of data is still effective. Chen et al. (2023) utilized data augmentation as a powerful auxiliary modelling tool to compensate for the lack of data and build an optimal RF model to predict the characteristics of hydrothermal biochar. The sensitivity analysis was subsequently used for RF model interpretation. The findings showed that temperature affects the hydrothermal reaction intensity and subsequently affects the mass yield of organic carbon and the total P content in biochar, which are the main key features of biochar preparation by hydrothermal carbonization. Compared with that of traditional machine learning algorithms, the accuracy of biochar property prediction greatly improved on average from 5.8% to 15.8% after data enhancement.

More data and factors can be considered to improve the prediction accuracy of models and provide a clearer understanding of the underlying mechanism. Adding related data from the additives in the modelling of Cr(III) and Cd(II) migration during pyrolysis cannot be ignored (Li et al. 2023b). The addition of biomass waste additives (BWA) increases the carbon and volatile matter contents of sludge and manure, which are harmful to the total concentration (TC) of heavy metals. The different types of inorganic additives (IAs) had different effects on the TC concentration of heavy metals and the retention rate (RR) of Cr(III) in biochar. Ca-based IAs have high thermal stability and are left in solid to increase biochar yield, which decreases the Cr(III) content while increasing its retention in biochar. Na- and Al-based IAs both increase Cr(III) content and retention, while K-based IAs have the opposite effect. The TC and RR of Cd(II) decreased with increasing BWA. Due to the low thermal stability of Cd(II), it might transfer from the solid to the gas or liquid phase during pyrolysis through decarboxylation, dehydration, and demethylation. Adding BWA can increase the oxygen and hydrogen content, thereby accelerating the pyrolysis process, which also decreases the feedstock nitrogen content to lower the RR of biochar. In terms of adding IA, K-IA, e.g., KOH and K2CO3, has a strong binding capacity and high specific surface area. Thus, Cd(II) may be selectively adsorbed and trapped easily by carbon-based materials with K-IA. However, not every publication has all the information we want; it is an enormous challenge to construct a satisfactory database if we consider all the relevant factors. Many studies have limitations in real applications, which are related to the quality and quantity of the collected data. Due to the variety of research methods, research objectives and experimental conditions, the input features selected according to the output targets are indeterminate. For example, the efficiency of immobilization is determined based on a wide array of features, such as bioavailability, the exchangeable fraction, the labile fraction, leaching, and the water-soluble fraction of HMs. There are many ways to determine effective HM concentrations in soil using deionized water, diethylenetriaminepentaacetic acid, toxicity characteristic leaching procedures, and calcium chloride extraction methods. These limitations may cause uncertainty in the prediction results and prevent us from precisely mirroring real-world conditions (Palansooriya et al. 2022). Therefore, studies should emphasize increasing the uses of effective datasets in the modelling process, even when constructing a comprehensive database that considers all HMs and important descriptors to improve machine learning models. This approach can provide a full understanding of environmental density during biochar application, which will be more meaningful and realistic.


Herein, we compare the accuracy of our model with that of a previous model (Dashti et al. 2023). Our model proposes an integrated approach based on the Gaussian noise-based data augmentation method and the LSSVM model to predict the sorption efficiency of biochar. In general, directly using the training set for developing the machine learning model may yield unacceptable results in simulations due to the overfitting problem and poor generalization ability. To achieve this, an equal amount of Gaussian noise is calculated from a normal distribution and randomly assigned to the variables of the dataset. The formula is as follows:

$$\begin{array}{c}N\left(x\right)=\frac{1}{\sigma \sqrt{2\pi }}\text{exp}\left(-\frac{{\left(x-\mu \right)}^{2}}{2{\sigma }^{2}}\right)\end{array}$$
(1)

where the expectation of Gaussian distribution is \(\mu\), the variance of Gaussian distribution is \(\sigma\), from which we can obtain:

$$\begin{array}{c}G\left(x\right)=g\left(x\right)+N\left(x\right)\end{array}$$
(2)

where \(g\left(x\right)\) is the original data distribution, and  \(G\left(x\right)\) is the function of augmented data. The experimental and comparable datasets were obtained from Dashti’s study. As shown in Fig. 4a, when we added Gaussian noise to the dataset, the R2 of the testing set increased from 0.94 to 0.97. In addition, we selected 30% of the experimental data as the training set. As shown in Fig. 4b, the R2 of the LSSVM model on the testing set was 0.88, which increased to 0.93 after Gaussian noise-based data augmentation. As shown in Fig. 4c and d, compared with the distribution of sorption values predicted by the LSSVM, the results obtained by combining the Gaussian noise-based data augmentation method   were more consistent with the experimental data, especially at the extremum points. The data augmentation method uses virtual samples generated from a small dataset to improve the generalization ability and performance of the prediction model, which also helps the process of further analysis to some extent.

Fig. 4
figure 4

Predicted vs. experimental sorption efficiency with a 20% and b 70% of data as testing set. Boxplot of sorption efficiency on c 20% and d 70% of data as testing set calculated from the experiment (red), LSSVM (green), and LSSVM with noise-based data augmentation (blue) respectively

Similarly, the relevancy factor (r) is also counted in the sensitivity analysis. The sensitivity analysis provides a quantitative description for measuring the importance level of input features on the output labels. The relevancy factor results are calculated from the predicted values. As shown in Fig. 5a), as the absolute value of r increases, the corresponding feature becomes more important for determining the sorption efficiency. Our model prediction results obtained consistent feature importance with the experimental data (Dashti et al. 2023). Next, we discuss the Sobol method that is used to compare the models’ stability on testing data. The Sobol sensitivity indices can be divided into the first-order, second-order, total‐order, and higher‐order sensitivity indices. The total‐order sensitivity indices are appropriate for evaluating a full range of feature spaces and the influence of different disturbance values on the output label values. In this study, minimum and maximum feature values were set after rescaling to [0, 1]. In Fig. 5b, the sensitivity indices determine the contribution of the features’ interactions to the overall model output label. The results indicate that the model had low sensitivity to variations in features within a limited range, and the stability of the model was verified via data augmentation.

Fig. 5
figure 5

a Effects of input features on sorption efficiency prediction using LSSVM with data augmentation. b Sobol sensitivity indices of input features

4 Conclusion, perspectives and challenges

In conclusion, the latest applications of machine learning as an advanced tool for determining the adsorption performance of biochar are summarized. Using sensitivity and interpretability analyses of models has become the pursuit of researchers. Thus, there is a demand for exploring novel, efficient descriptors, high-quality databases, and practical techniques to accelerate intelligent experimental control. Based on the related literature, the following aspects deserve special mention as perspectives and challenges for promoting the operational application of machine learning in the removal of HMs by biochar in the environment and minimizing the disparities in knowledge between dissimilar subjects:

  1. 1.

    The input feature space should be expanded to obtain more closely correlated or more important properties; some studies have used a strategy that adopts graph data as molecular representations and combined deep learning methods. In addition, the removal efficiency of HMs by biochar depends heavily on the molecular structure characteristics of the biochar, and there is little related research and discussion. In future research, more input descriptors need to be explored in detail.

  2. 2.

    Machine learning is essentially a statistical method that has certain requirements for the amount of data, especially in deep learning, which has higher demands on data than traditional machine learning. Using active learning, Bayesian optimization, or other algorithms to obtain more valuable data is effective, e.g., combined with Gaussian noise-based data augmentation could improve the accuracy and generalizability of the model in this study.

  3. 3.

    The application of machine learning techniques to biochar cultivation is scalable and practical, but several challenges remain in terms of building and normalizing design databases. Standardization of the data format and data conventions is a key avenue for increasing data accessibility to researchers and benefiting transdisciplinary applications. However, significant obstacles are waiting to be overcome, datasets with comprehensive and unified information are expected to be obtained through either the addition of generalized descriptors or the combination of quantum computing.

  4. 4.

    Neural network algorithms are extensively applied in materials science, but a notable challenge lies in their inherent lack of interpretability. This lack of transparency hinders the understanding of the underlying relationships between input features and output predictions, limiting the trust and adoption of these models in critical decision-making processes within materials science. Addressing the interpretability gap of the existing models remains a crucial area of research, and it is also necessary and urgent to exploit grey-box or white-box models with high interpretability to promote the acquisition of clear physiochemical laws.

  5. 5.

    Impacted by the big data-driven “fourth paradigm”, the application of machine learning in the optimization of HM removal processes is both an opportunity and a challenge. It is necessary to concentrate on mitigating the existing constraints through further research and enable more researchers to utilize biochar materials efficiently to protect the environment.