1 Introduction

Biochar is a stable form of charcoal produced by pyrolyzing biomass in a low-oxygen environment (Pan et al. 2019). It offers various benefits, including enhanced soil fertility (Yang et al. 2019), improved water retention (Hussain et al. 2020), increased crop yields (Chen et al. 2022), reduced greenhouse gas emission (Zhang et al. 2024), sequestering carbon from the atmosphere (El-Naggar et al. 2018), and contaminants removal (Liu et al. 2022). The porous structure, high surface area, and catalytic ability of biochar provide abundant sorption sites and facilitate chemical reactions, rendering biochar effective in pollutant sequestration, soil remediation, and diverse catalytic processes (Liang et al. 2021). The performance of biochar, such as its catalytic ability and adsorption capacity, is influenced by a combination of factors including its physical characteristics (e.g., porosity and surface area), chemical attributes (e.g., pH and composition), and the specific origin of feedstock (Yuan et al. 2022). During the production processes of biochar, the cleavage of covalent bonds in organic macromolecules or electron transfer between free radical precursors and metals in biomass produces a new substance in biochar, known as persistent free radicals (PFRs) (Liao et al. 2014; Ruan et al. 2019). Unlike traditionally recognized short-lived free radicals, PFRs demonstrate a prolonged lifespan ranging from hours to months (Pan et al. 2019). This extended lifespan enables PFRs to remain active in the environment, resulting in more enduring and significant effects (Fang et al. 2014).

PFRs exhibit both positive and negative effects across various biochar applications. Positively, PFRs facilitate electrons to stable oxidants (e.g., O2, H2O2, and persulfate), thereby generating diverse reactive oxygen species (ROS) crucial for the degradation of organic pollutants (Fang et al. 2014, 2017; Qin et al. 2017). Furthermore, PFRs effectively facilitate the reductive or oxidative transformation of heavy metals, like Cr(VI) (Dong et al. 2014; Zhu et al. 2023) and As(III) (Dong et al. 2014), by macromolecular free radicals without oxidants (Qiu et al. 2023). Conversely, owing to their stability and persistence, PFRs can induce adverse effects by prolonging interactions. These radicals have the potential to trigger ROS production, hindering seed germination and bud growth in crops (Liao et al. 2014). Furthermore, they can induce oxidative stress in the human organisms, resulting in ROS production and subsequent molecular oxidation of tissues, as well as DNA damage (Chuang et al. 2017; Liu et al. 2023). Hence, in contexts where the primary objective is pollutant remediation, an elevated presence of PFRs proves advantageous due to their efficacy in long-term remediation efforts. However, when biochar is utilized as a soil amendment to support crop cultivation, a reduced content of PFRs is preferable. Therefore, the anticipation of PFR levels in biochar prior to its preparation holds paramount significance and should be tailored to specific application requirements.

Although several studies have identified key factors affecting PFRs formation in biochar, including feedstock (Wang et al. 2022), pyrolysis temperature (Zhang et al. 2022; Prasertpong et al. 2023), and substituted aromatics (Wang et al. 2022), their prediction remains challenging due to the involvement of highly variable synthesis methods, diverse precursors, and complex reaction processes. In recent years, data-driven techniques, notably employing machine learning (ML) algorithms such as random forests (RF), support vector machines (SVM), and deep learning (DL) approaches like multilayer perceptron (MLP), have gained significant attention in constructing predictive models across various domains of environmental science (Zhong et al. 2021), including contaminant monitoring (Gao et al. 2021; Ullah et al. 2023), micropollutant oxidation (Cha et al. 2020), and new materials designing (Tang et al. 2020). Recently, the scope of ML predictions has expanded to encompass diverse applications within biochar research, such as forecasting of micronutrients (Ullah et al. 2023), heavy metal immobilization and migration (Li et al. 2023), wastewater treatment (Kanthasamy et al. 2023), antibiotic adsorption (Zhang et al. 2023a), biochar functioning as biocatalyst (Wang et al. 2023), and its impacts on GHGs emissions (Han et al. 2024). These algorithms offer distinct advantages over traditional statistical methods by capturing nonlinear and complex relationships between features and target variables (Zhu et al. 2020). For example, RF utilizes recursive binary splitting of data to minimize the residual sum of squares, providing advantages such as the ability to make local and global predictions, non-biased weights, and effective management of imbalanced and small datasets (Golden et al. 2019; Konstantinov and Utkin 2021). SVM employs hyperplanes to separate classes and maximize the margin, making it suitable for handling complex relationships and high-dimensional data (Hussain 2019). MLP is an artificial neural network architecture comprising interconnected nodes that process input data and learn complex patterns (Bihl et al. 2023). In parallel with the adoption of ML algorithms for predictive modeling, the integration of GUI models has become pivotal. GUIs offer intuitive platforms for researchers to interact with complex ML algorithms, streamlining model development, parameter tuning, and result visualization (Suthers et al. 2021). This enhanced accessibility promotes broader adoption of ML techniques, empowering researchers to tackle complex phenomena like PFR prediction in biochar with greater efficiency and accuracy.

In order to address the dearth of studies on predicting the PFRs in biochar, we developed five ML-based regression and classification models: RF, extreme gradient boosting (XGBoost), light gradient boosting machines (LGBM), SVM, and MLP. The primary objective of this study is to elucidate the systematic application of ML tools for predictive analytics and their utilization in extracting valuable insights into the formation process of PFRs during the biochar preparation across different feedstock. A data-driven approach was applied and illustrated in Supporting Information Fig. S1. Firstly, data collection and preprocessing were conducted to ensure its quality and suitability for analysis. Secondly, a comprehensive descriptive statistical analysis was performed, and the Spearman correlation coefficient (SCC) was calculated to identify potential relationships between variables. Thirdly, both regression and classification algorithms were developed and compared to predict PFRs. Feature analysis based on the results of predictive ML models was then carried out. Lastly, a Graphical User Interface (GUI) was developed to enhance the accessibility in predicting PFRs in biochar.

2 Methodology

2.1 Data collection, imputation, preprocessing, and correlation analysis

For the data collection process, a comprehensive review of the literature was performed based on published articles in Web of Science and Scopus databases using the keywords of “Biochar” AND “EPFR”, “Biochar” AND “Environmentally persistent free radicals”, “Biochar” AND “PFR”, “Biochar” AND “Persistent free radicals”. The keyword search returned over 110 experimental works, which were then manually screened by including all the data for the two empirical categories (1) biochar and (2) PFRs properties, including 9 features and the 2 predictors. Therefore, 30 articles were shortlisted and deemed suitable for data collection relevant to the study (Table S1). All screened data were initially accepted impartially, without any initial judgment or bias regarding the data validity. It is imperative to highlight that a substantial portion of our dataset originated from assays utilizing the DPPH standard coupled with EPR detection. Regarding other standards, such as Cr3+ diamond, their utilization in preceding studies is limited, with only one study identified employing Cr3+ for PFRs detection. All the data points were carefully marked off to avoid duplicate or multiple entries. For the articles whose data were not directly listed in tables or as text, the open-source WebPlotDigitizer software was used to obtain the necessary data from their figures.

The detailed procedure of ML exploration associated with PFRs in biochar is illustrated in Fig. S1. Ten parameters were identified from biochar properties as input features, including pyrolysis temperature (PT oC), specific surface area (SSA m2 g−1), doping index (DI), feedstock (FS), retention time (RT [min]), pH, carbon content (C% wt [%]), oxygen content (O% wt [%]), hydrogen content (H% wt [%]), and nitrogen content (N% wt [%]), and 2 parameters were defined as output predictors, including the content (1015 spins g−1) and types (g-factor) of PFRs. In the classification of PFR types, three distinct g-factors exist: g1 denotes carbon-centered radicals with values below 2.0030, g2 corresponds to carbon-centered radicals with adjacent oxygen with values range of 2.0030–2.0040, and g3 designates oxygen-centered radicals with g-factors exceeding 2.0040 (Tian et al. 2009; Ruan et al. 2019). Among these, g1 PFRs exhibit more stability and reduced reactivity than g3, attributed to the lower electronegativity of carbon atoms than oxygen (Dellinger et al. 2007). From the total dataset, 11, 9, and 170 data points were missing for the SSA, C%, and pH data, respectively. To fill these data gaps and ensure a consistent dataset, a ML model was employed to predict the missing values of SSA and C% using other features, following the methodology provided by Yuan et al.(2021). In summary, highly correlated features were identified by SCC analysis (as shown in Fig. 1i), which included SSA, C%, and PT. These features were utilized to impute missing values using the RF model. The RF model was trained using available data points, while missing data points were treated as testing data and subsequently predicted based on the trained model. The feature pH was excluded from the dataset due to adequate data unavailability (Palansooriya et al. 2022). Finally, 9 input features and 2 output predictors, comprising 253 data points were obtained and used for machine learning exploration (Table S2).

Fig. 1
figure 1

Box-normal plots (a-f) representing the descriptive statistics of the data for each variable collected from the literature. Bar-plots (g, h) represent the frequency of categorical features. Spearman correlation coefficient matric (i) represents the correlations among input features. Most of the features showed a normal distribution of data, except for PFRs content (spins) and N%, where a subtle left skew and stronger left skew were observed, respectively. The abbreviated features, along with their respective units, are as follows–PT: pyrolysis temperature (oC), SSA: specific surface area (m2 g−1), C%: carbon content (wt %), O%: oxygen content (wt %), H%: hydrogen content (wt %), N%: nitrogen content (wt %), DI: doping index, and FS: feedstock

To rectify the imbalance within the dataset, specifically with respect to the DI feature, an up-sampling technique was implemented (detailed information is available in Text S1; Fernández et al. 2018). Subsequently, to enhance the efficiency of the ML models during training for rapid convergence, the input features underwent encoding and normalization utilizing Scikit-Learn. Our training dataset comprises 2 categorical and 9 numerical features. It is necessary to convert categorical features into numerical variables for the interpretation of ML models. Hence, we utilized LabelEncoder for DI, and OrdinalEncoder for FS (McGinnis et al. 2018). Numerical features were uniformly scaled using StandardScaler to obtain a similar scale and approximation of a normal distribution.

Descriptive statistical analysis of input and target features was conducted and the SCC was used to investigate the correlation among input features. Values fall between the range of −1 to 1, where 0 indicates no linear correlation, and a high negative or positive value indicates a strong negative or positive correlation, respectively.

2.2 Model construction and hyperparameter optimization

Five ML algorithms (e.g., RF, XGBoost, SVM, LGBM, and MLP) were compared and evaluated to predict the contents and types of PFRs in biochar, as regressor and classifier ML models. These ML models have been proven to be suitable and successful for midsize datasets (Zhu et al. 2019; Li et al. 2020, 2021). A description of the target features, used in ML models for PFRs prediction in biochar is available in Text S2.

The entire dataset containing 253 data points, was subjected to multi-training by splitting them into randomly chosen training and testing subsets. Thus, 85% of the total data points were randomly selected and labeled as training data, and the remaining 15% was labeled as test data for the final evaluation of the developed models. During the training phase, we meticulously adjusted the hyperparameters of each algorithm to minimize the mean-squared error for PFRs prediction using five-fold cross-validation. Furthermore, to ensure the robustness of the model, we conducted a Y-scrambling analysis (Moore et al. 2022). Various machine learning algorithms underwent tuning with specific hyperparameters. For instance, ensemble models (RF, XGBoost, LBGM) required fine-tunning of parameters including the number of trees, depth of each tree, and max_features. SVM hyperparameters, including epsilon (ε), kernel function, and penalty (α), were also optimized. Additionally, NN configurations involved fine-tuning of parameters like hidden layer sizes, activation function, and learning rate, to improve model convergence (Table S3; Palansooriya et al. 2022). To explore these crucial parameters comprehensively, we utilized a param_grid containing various combinations of hyperparameters. GridSearchCV was then employed to systematically search through all combinations of these hyperparameters using cross-validation, aiming to identify the optimal combination that maximizes the specified scoring metric.

2.3 Performance evaluation of regression and classification models

The assessment of regression model performance relied on metrics such as the coefficient of determination (R2) and the root mean square error (RMSE) (Zhu et al. 2019; Hu et al. 2022). R2 and RMSE values were calculated using Eqs. (1) and (2), respectively.

$${R}^{2}=1-\frac{{\sum }_{n=1}^{N} (\hat{{\text{y}} }-y{)}^{2}}{{\sum }_{n=1}^{N} (\hat{{\text{y}} }-\hat{{\text{y}} }{)}^{2}}$$
(1)
$$RMSE=\sqrt{\frac{{\sum }_{n=1}^{N} (\hat{{\text{y}} }-y{)}^{2}}{N}}$$
(2)

where ŷ, y, and are the predicted, actual, and mean values of the target feature, respectively, n is the data point at any given instance, and N is the total number of data points.

The performance of classification models was evaluated by AUROC and Confusion Matrix (CM), which are described in Text S3. In the ROC curve, the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis were calculated through Eqs. (3) and (4), respectively.

$$TPR =\frac{TP}{TP+FN} \times 100 \left(\%\right)$$
(3)
$$FPR =\frac{FP}{TN+FP} \times 100 \left(\%\right)$$
(4)

2.4 Model interpretation and feature importance

The ML model was utilized to investigate the significance and influence of each feature on the target features related to PFRs. Feature analysis was conducted using the SHapley Additive exPlanations (SHAP) method, a widely employed technique in feature analysis (Text S4; Lundberg et al. 2018; Onsree et al. 2022; Prasertpong et al. 2023). All modeling tasks were executed in Python (version 3.09), utilizing open-source libraries such as Pandas and NumPy for data processing, Scikit-learn for encoding, scaling, handling the dataset’s imbalanced nature, and imputation of missing values, seaborn and matplotlib for visual representation, and SHAP for model interpretation and feature importance assessment.

3 Results and discussion

3.1 Descriptive statistics and correlation analysis

The raw data, comprising 253 data points, underwent a descriptive analysis to obtain preliminary insights into all input features and target variables. This analysis involved computing the minimum, maximum, and average values for numerical features along with frequency counts for categorical features. The contents and types of PFRs are usually determined by the FS, PT, SSA, and other properties of biochar (Yuan et al. 2022). Concerning output variables, the content of PFRs based on literature data ranged from the lowest value of 0.103 × 1015 spins g−1 to the maximum value of 5210 × 1015 spins g−1 with the mean value and standard deviation (SD) of 330 × 1015 spins g−1 and 738 × 1015 spins g−1 (Fig. 1a), respectively. The g-factor of PFRs in the literature varied from 2.0020 to 2.0049 with a mean value of 2.0033 and SD equaling to 0.00060 (Fig. 1b). For the input variables, the reported values of PT in the literature varied from 200°C to 700°C (Fig. 1c). The SSA ranged from 1.65 m2 g−1 to 231 m2 g−1 with a mean value of 86.5 m2 g−1 and SD equaling to 56 m2 g−1 (Fig. 1d). The mean and SD values for C%, O%, H%, and N% were 49.5 ± 20.2, 32.0 ± 11.0, 6.0 ± 2.0, and 1.95 ± 1.55 wt (%), respectively (Fig. 1e and f). The categorical features including FS and DI, underwent ordinal and label encoding, respectively. The total value counts for biochar with doping and without doping were 74 and 179, respectively (Fig. 1g). Meanwhile, FS value counts from literature for woody lignocellulosic biochar (WLCB), non-woody lignocellulosic biochar (NWLCB), non-lignocellulosic biochar (NLCB), and co-pyrolysis of different feedstock (CP) were 71, 74, 77, and 31, respectively (Fig. 1h).

The SCC was utilized to examine the relationship among features (Fig. 1i), including the two categorical features. Strong positive correlations (SCC values > 0.75) were observed between PT with SSA and SSA with C%, while strong negative correlations were observed among features C% with H% (-0.58), PT with O% (-0.53), and H% (-0.51). Generally, higher PT results in biochar characterized by increased SSA and reduced atomic ratios of hydrogen to carbon H/C and oxygen to carbon O/C (Xiao et al. 2016; Wang et al. 2022). These changes are attributed to enhanced porosity, dehydration and devolatilization, carbonization, and reduction of functional groups at elevated temperatures (Bushra and Remya 2020; Yang et al. 2023). These changes result from the decomposition of volatile compounds, leading to a higher carbon proportion relative to hydrogen and oxygen in the biochar. The hybrid correlation identified among the various input features facilitates their retention in building an effective predictive model, as each feature contributes independently to the model’s predictive capacity.

3.2 Development of regression and classification predictive models

Five ML models, including RF, XGB, LGBM, SVR, and MLP were developed and evaluated for their capability to predict the content and types of PFRs in the dataset using the input features described previously in the data preprocessing section. Table 1 presents a comparison of the performance of ML models. The models demonstrated comparable performances for predicting PFRs contents (spins g−1), with XGBoost and RF achieving the highest R2 values of 0.98, closely followed by the LGBM with an R2 of 0.95. However, SVR and MLP exhibited relatively lower R2 values of 0.78 and 0.87, respectively (Fig. S2 and Table 1). In terms of predicting PFR types (g-factor), the XGBoost exhibited the highest performance with an accuracy of 0.92, followed by LGBM and RF with accuracies of 0.90 (Fig. S3 and Table 1). MLP and SVM models exhibited slightly lower accuracies of 0.87 and 0.86, respectively. Notably, the SVR and MLP models achieved relatively lower R2 and accuracy values for PFR prediction (Table 1). This disparity in performance could be attributed to the smaller and imbalanced dataset used in the study, which may not fully capture the intricacies of the underlying patterns and relationships in the target population, particularly when certain classes are underrepresented (Jordan and Mitchell 2015). Previous studies have highlighted that ensemble models, such as XGBoost, tend to perform better on smaller datasets due to their ability to handle such complexities while being less sensitive to overfitting (Golden et al. 2019; Zhu et al. 2020), whereas SVM and MLP provide optimal performance for large datasets (Padarian et al. 2019).

Table 1 Comparative evaluation of regression and classification models

XGBoost emerged as the most proficient predictive model for both PFR content and types. To access its feasibility on the test dataset, its performance is visually depicted in Fig. 2, which encompasses data from 77 samples. In Fig. 2a, a joint scatterplot effectively illustrates the relationship between the actual and predicted values of PFR content on biochar. The XGBoost model exhibits remarkable predictive capabilities of PFRs in biochar, as evidenced by the test R2 value of 0.95, highlighting its robust generalization abilities on unseen data. Figure 2b, illustrates a close match between actual values and model predictions, indicating that the XGBoost model accurately captures the underlying relationships between biochar and the content of PFRs. It should be noted that there are no evident signs of overfitting or underfitting in Fig. 2b, as the predicted values, presented by the blue color dotted line, follow a smooth trajectory without significant deviations from the actual data points, implying a balanced model complexity (Ma et al. 2023).

Fig. 2
figure 2

Prediction performance of the XGBoost on dataset. a Joint scatter plots with marginal histograms on the X and Y axis of the actual vs. predicted values of PFRs content (1015 spins g−1), and blue shades represent 95% confidence intervals of the regression line on the test point. b Scatter plot of actual vs. predicted values of train vs. test dataset of PFRs content (1015 spins g−1). c Area Under the Receiver Operating Curve (AUROC) for the multiclass prediction of the types of PFRs (g-factor); solid lines and gray shade represent the precision-recall curves and its 95% confidence intervals for each class of g-factor. d Confusion matrix of true vs. predicted labels of g-factor

For the prediction of PFRs types, Fig. 2c, displays the AUROC plot of the XGBoost classifier model's performance for each class of PFRs (g1, g2, and g3). The ROC curve for g1 achieved an impressive AUC of 0.92, while g2 and g3 had AUCs of 0.89 and 0.98, respectively. Moreover, the mean ROC was 0.92, averaged across all three labels. In Fig. 2d, the XGBoost model's classification performance was evaluated on three distinct labels: g1, g2, and g3. The confusion matrix analysis revealed accurate predictions for 22 instances of g1 while misclassifying 6 and 0 as g2 and g3, respectively. For g2, the model correctly predicted 23 samples, but it misclassified 4 as g1 and 3 as g3. Similarly, for g3, the model demonstrated strong predictive abilities, accurately classifying 12 samples, while misclassifying 1 and 1 for both g1 and g2. The combined results from the AUROC and CM plots confirm the effectiveness of the XGBoost model in accurately classifying PFRs into their respective classes, validating its overall efficacy in classification tasks.

Additionally, we conducted a permutation test, specifically Y-scrambling, to further evaluate the model's performance and confirm that the models were not randomly obtained. This involved scrambling the labels of PFRs within the training set 100 times, creating 100 pseudo training sets. For each pseudo training set, the model was built using 80% of the data with optimized parameters and its performance was evaluated based on the remaining 20%. Subsequently, the predictive performances of these 100 pseudo training sets were compared with those of the original training set (Table S4). The analysis revealed a significant difference between the original data and the Y-scrambled data, indicating that the original models outperform random chance.

3.3 Model-based interpretation and feature exploration

Feature importance analysis was conducted utilizing a fine-tuned XGBoost model to evaluate the relative importance of biochar properties in predicting the contents and types of PFRs by regression and classification models (Fig. 3). The bee-swarm summary plot in Fig. 3a, provides an overview of the correlation and directionality between the biochar properties and their corresponding Shapley values during model construction for predicting PFRs content. The bar plot in Fig. 3b displays the contribution and average impact of each biochar property on the prediction of three types of PFRs using mean SHAP values. The important features for predicting the content of PFRs were ranked as follows: DI, PT, C%, SSA, FS, H%, O%, N%, and RT. Important features for the prediction of the PFR types were identified in the order of SSA, PT, C%, FS, O%, H%, N%, RT, and DI.

Fig. 3
figure 3

ML-based feature importance analysis from the XGBoost model using SHAP values. a Bee-swarm summary plot for the contribution of each property of biochar for predicting PFRs content (spins g−1); red indicates high SHAP value and blue indicates low SHAP value. b Feature importance matrix plot for the contribution of each property of biochar for predicting PFRs types (g-factor). Blue, pink, and green bars indicate the mean SHAP value for g1, g2, and g3, respectively. Higher SHAP values of each feature indicate higher levels of PFRs in biochar

Based on the outcomes derived from SHAP analysis, the top four influential features of biochar for PFR content are DI, PT, C%, and SSA, while for PFR types are SSA, PT, C%, and FS were explored for their intrinsic correlation with PFRs. FS is considered as a pivotal role in PFRs, despite the diversified elemental composition from various feedstocks. A consistent PFR trend was observed when the PT ranged from 200 to 700°C (Deng et al. 2020). Although FS appears to exert a relatively lesser influence in our study, its role in predicting PFRs is detailed in Text S5. Furthermore, to analyze the impact of each data point and its effect on predicting PFRs, SHAP partial dependence plots (SPDP) and kernel density estimations plots (KDE) were employed as depicted in Figs. 4 and 5, considering both categorical and numerical factors. Moreover, to corroborate the reliability of SPDP through ML-interpreted methods, box and count plots with binning features for PFRs content and types were presented in Figs. 4a-d, and 5a-d, respectively. Remarkably, DI emerged as the most significant contributor to determining the content of PFRs within biochar, as evidenced in Fig. 3a. The SPDP boxplot analysis, as illustrated in Fig. 4a, revealed a positive correlation between PFRs content and DI. Regarding the variation in the PFRs content within biochar, we observed a range of 1–1000 × 1015 spins g−1 for non-doped, 2000 × 1015 spins g−1 for the copper-doped, and 3500–4500 × 1015 spins g−1 for zinc, nickel, and iron-doped. Notably, the highest PFR content, reaching up to 6000 × 1015 spins g−1, was observed in non-metal-doped biochar enriched with N and S in the box plot of Fig. 4a (Yu et al. 2020; Zhang et al. 2022; Zhang and Zhao 2022). These findings align with previous research, affirming that doping elements enhance PFRs content in biochar. For example, Yu et al. (2020) employed electron paramagnetic resonance (EPR) spectroscopy to demonstrate the capacity of N-doping in biochar. In a study involving antibiotic degradation in various biochar with elemental doping, the highest PFRs content was found to be 9.23 × 1019 spins g−1 in N-doped, followed by 6.10 × 1019 spins g−1 in S-doped and 4.36 × 1019 spins g−1 in NS-dopped biochar, compared to 2.45 × 1018 spins g−1 in non-doped biochar (Zhang et al. 2022). This doping process redistributes electrons to neighboring carbon atoms through interconnected p-conjugated networks of polymeric carbon nitride, consequently increasing the PFRs content in biochar. Meanwhile, N or S doping also modifies internal structures and defect values (ID/IG) of biochar for the enhanced PFRs contents without affecting the g-factor of PFRs (Zhang and Zhao 2022).

Fig. 4
figure 4

SHAP partial dependence plots (SPDP) of the XGBoost model. a-d SPDP of four most important biochar properties influencing PFRs content (DI, PT, C%, and SSA%) and their boxplots with binning features to elucidate the correlations between features and the content of PFRs

Fig. 5
figure 5

Kernel density estimation plots (KDE) of the XGBoost model. a-d Three KDE plots and one box plot of the four most important biochar properties influencing PFRs types (SSA, PT, C%, and FS), and their count-plots with binning features to elucidate the distribution and frequency of the types of PFRs

Feature analysis identified PT as the second most important factor influencing both the content and types of PFRs (Fig. 3). The SPDP in Fig. 4b displays a unimodal distribution curve, indicating a positive association between PT and PFRs content within the temperature range of 200–500°C. This relationship significantly enhanced the model’s predictive capacity within this temperature range. However, the impact of PT on PFRs content shifted to as temperatures exceeded 500°C. These outcomes highlight that our prediction results are comparable with the reported results detected by EPR spectra, suggesting a favorable temperature range of 400–500°C for maximizing PFRs content (Bi et al. 2022). To elucidate the relationship between PT and PFRs types, a KDE plot in Fig. 5b, was employed. This visual representation depicted a gradual increase in g1 and g2 with rising PT, while g3 exhibited a distinct decline. Further analysis delved into the specifics of g1, g2, and g3 distributions. The count plot in Fig. 5b demonstrates that g1 predominantly emerged within the PT range of 400 to 700°C, accompanied by a gradual decrease in its concentration within this interval. In contrast, g3 was primarily observed below 500°C, while g2 was situated between 200 and 600°C. Notably, the absence of g1 and g3 was noted at 200°C and 700°C, respectively. The findings suggested that as the temperature increases, PFRs gradually transform from g2 to g3 and then to g1, indicating that the elevation of pyrolysis alters the type of PFRs present, consistent with a prior research reported by Odinga et al. (2020). This phenomenon may be attributed to the influence of temperature on oxygen-containing functional groups present in biochar. At 300 °C, PFRs primarily consist of oxygen-centered radicals (g3 > 2.0040) or carbon-centered radicals with oxygen atoms (g2: 2.0030–2.0040), originating from phenolic organic compounds in biochar that form stable phenoxy radicals through electron transfer to transition metals. As temperature surpasses 400 °C, carbon-centered radicals (g1 < 2.0030) gradually become the dominant PFRs, forming cyclopentadienyl groups at higher temperatures (Zhang et al. 2020). At 700 °C, PFRs rapidly break down, rendering their signal undetectable (Odinga et al. 2020).

The C% within biochar emerged as a pivotal predictor for PFRs, ranking third in feature importance for both PFRs contents and types (Fig. 3). Previous studies have underscored the significance of C% in influencing g1-type PFR and overall PFRs content (Zhang et al. 2022). Our findings reinforce this relationship, revealing a direct correlation between PFR content and higher C% in biochar, as depicted in Fig. 4c. This connection is strengthened by the elevated SHAP values associated with C%, enhancing the predictive capability of ML model for PFR content. Notably, our analysis indicated that biochar samples with C% falling within the 40% to 80% range exhibit significantly higher PFRs levels. These results could explain the previous findings observed by Zhang et al. (2023b), where the addition of polystyrene (PS) during antibiotic degradation resulted in an increased C% and PFRs content within biochar. Additionally, Ni et al. (2020) found that elemental analysis of biochar following PS-addition pyrolysis revealed a remarkable 90% increase in C%, subsequently elevating PFR levels and the prevalence of g1-type (< 2.0030) PFRs within the biochar matrix. Another study corroborated the significance of C% for PFRs, indicating that higher urea proportions led to reduced C% and PFR contents but elevated N% in biochar (Bi et al. 2022). Furthermore, our analysis in SCC also indicated a negative correlation between C% and N% in biochar (Fig. 1i). These insights highlight the significant role of C% in PFR formation, with N showing an inverse correlation with PFRs, likely due to the presence of distinct N-types like pyridine N, pyrrole N, and quaternary N in N-rich biochar, absent in pure cellulose biochar (Tian et al. 2013, 2014). Additionally, our study reveals the complex interplay of N% in biochar, influenced by DI and urea addition. Non-metal-doped biochar with high N% exhibits elevated PFR content through electron redistribution (Zhang and Zhao 2022), while urea introduction increases N% but decreases PFR content (Bi et al. 2022), indicating a multifaceted relationship between DI, urea, and PFRs which may be influenced by various unknown factors. Further research is needed to elucidate the specific nitrogen-containing compounds generated by urea and their impact on PFR properties. The relationship between C% and the g-factor of PFRs interaction is depicted in the KDE plot in Fig. 5c. Notably, an increase in C% corresponded to a decreasing KDE trend of g3 and a simultaneous increase in g2 and g1. This observation is further reinforced by the count plot, which illustrates that g1 is less likely to be present in biochar with low C%, in contrast to those with higher C%. Notably, a C% range of 40–80% is most conducive to supporting the presence of g1-type PFRs, while g3 is more likely to be associated with biochar containing less than 60% carbon. Additionally, g2 tends to fall within the mid-range of C%. These findings underscore the crucial role of C% in biochar which amplifies the production of PFRs particularly those characterized by carbon-centered radicals (g1 < 2.0030) (Ni et al. 2020; Zhang et al. 2023b).

SSA of biochar in our study was investigated as an important feature and ranked first and fourth in predicting the g-factor and spins g−1 of PFRs in biochar, respectively (Fig. 3). SSA of biochar increases as the PT increases, leading to a higher content of carbon-centered PFRs compared to lower SSA (Zhang et al. 2022). Our results in the KDE plot (Fig. 5a) indicated the positive contribution of PFRs (g1) with increasing SSA, while also showing a decrease in g2 and g3-type PFRs. Further insights from the SHAP values within the KDE plot reveal that an increase in SSA enhances the prediction accuracy of g1, while slightly reducing the predictive performance of g2 and g3, particularly up to SSA 150 m2 g−1. A closer examination using count plots of Fig. 5a further supports these findings. They reveal that g1 tends to be more prevalent within the SSA range of 100–200 m2 g−1, whereas g3 was predominantly present in the lower range of SSA below 50 m2 g−1. Interestingly, g2 appeared primarily within the SSA range of 50–150 m2 g−1, reinforcing the intricate interplay between SSA and the composition of PFRs in biochar. The correlation of SSA and the contents of PFRs is shown in Fig. 4d, which depicts higher PFR content within SSA ranges from 100–150 m2 g−1. SHAP values indicated a positive influence on PFR content prediction up to an SSA of 100 m2 g−1, beyond which a diminishing trend became apparent, potentially impacting the model's predictive performance. While these findings may deviate from some prior studies, the limited data available at higher SSA levels could contribute to this disparity. Further analysis brought to light an elevated N% in biochar collected at increased SSA, suggesting a complex interaction. The SHAP interaction plot was used in Fig. S6 to unveil the interplay between higher N%, SSA, and PFR content, influencing the model's predictive accuracy. Notably, an increase in N% within nitrogen-rich biochar affected PFR contents without altering the g-factor (Bi et al. 2022). Our findings align with previous studies, suggesting that PFRs content is mainly related to the SSA of biochar and the degree of defect structure (ID/IG values) of raw material (Zhang and Zhao 2022). It may be attributed to the availability of more active sites in biochar, which positively contributes to the content and carbon-centered PFRs (Zhang and Zhao 2022). These results enhance our understanding of the complex relationship between SSA, N and PFRs formation in biochar.

3.4 Development of graphical user interface (GUI) tool

To ensure the accessibility of our prediction models to scientists and practitioners, web-based GUI tool was developed (refer to Fig. S8). This tool accepts biochar properties as optimal input variables and provide two essential outputs: it employs regression analysis to predict the content of PFRs (spins g−1), and utilizes classification analysis to predict the types of PFRs (g-factors, including g1, g2, and g3) for practical real-world applications. This scientific tool not only streamlines the prediction process but also equips researchers and scientists in the biochar domain with a sophisticated resource for comprehensive PFR analysis. Consequently, it presents a valuable opportunity to save both time and resources in research and engineering endeavors that focused on exploring PFRs within biochar materials.

4 Conclusion

The double-edged sword property of PFRs presents an intriguing environmental challenge. We aim to harness their beneficial effects, primarily in pollutant cleanup, by increasing their presence while minimizing their detrimental impact when using biochar as a soil amendment for crop growth, wherein lower PFR levels are preferred. However, achieving the desired PFR in biochar is contingent on several factors, presenting a substantial challenge (Wang et al. 2022; Zhang et al. 2022; Prasertpong et al. 2023). Moreover, the exclusive reliance on EPR for PFR detection poses limitations, as EPR is not widely accessible, is often expensive, requires expertise for spectral interpretation, and is inapplicable before biochar preparation, hindering the optimization of preparation parameters for PFRs. This underscores the importance of predictive modeling for PFRs in biochar. In addressing this knowledge gap, we embarked on an ML-based study, incorporating data sourced from reputable journals. This approach resulted in robust predictive performance. Consequently, our study illuminates the role of PFRs in biochar research, effectively filling a critical knowledge void in the field. To enhance the accessibility of PFR prediction for general applications, we have integrated prediction models in a web-based GUI tool. This tool harnesses the power of our ML models to provide a seamless and efficient prediction experience. Furthermore, we plan to incorporate some additional factors such as pH, and heavy metal contents to enhance the model's predictive capabilities for understanding the properties of PFR in various environmental matrices. We aim to advance our comprehension of the environmental implications of biochar and its intricate interactions with PFRs, ultimately contributing to more effective and sustainable environmental applications of biochar.