1 Introduction

Tetracycline (TC) is a persistent organic pollutant found in surface, ground, and drinking water, which can cause endocrine disruption and transmit antibiotic-resistance genes, offering serious human health and environmental dangers (Bilal et al. 2020; Zhang et al. 2020). Researchers have paid much attention to the problems of incomplete metabolism and TC emissions recently (Zeng et al. 2021), as they are frequently used as an antimicrobial agent and feed additive in agriculture and livestock production (Gopal et al. 2020). Chemical oxidation, biological treatment, and physical removal are the main TC wastewater treatments (Phoon et al. 2020). The utilization of biological methods for the elimination of TC from wastewater poses significant challenges due to its antimicrobial properties (Zhu et al. 2021a). Because of its intrinsic benefits, such as simplicity, cheap cost, and great efficiency, adsorption is regarded as an excellent technology for treating TC (Cheng et al. 2021). Due to its distinctive qualities, including a large specific surface area, homogeneous pore distribution, and a high abundance of surface functional groups, biochar (BC) has received extensive research as an adsorbent for removing contaminants from wastewater (Akhil et al. 2021).

Van der Waals forces and hydrogen bonds, along with covalent and ionic bonding, are primarily responsible for the absorption of pollutants into BC (Thangaraj & Solomon 2019). As a result, the BC’s characteristics, the adsorption environment, and the ratio of adsorbate to absorbent all play major roles in the adsorption process. Previous research has extensively assessed the traditional kinetic and isothermal adsorption models (Chen et al. 2018; Jang and Kan 2019; Liu et al. 2021). Results indicated that electrostatic interactions and chemisorption are among the potential adsorption mechanisms. Within the same framework, the relationship between each influencing element and the amount of sorption can be determined using a normal controlled-variable experimental approach. However, traditional batch sorption studies are time-consuming and inefficient for choosing the opt BC (Li et al. 2022). Predicting adsorption efficiency, improving process parameters, and understanding the adsorption mechanism require realistic tools, which further urged to explore the advancements in machine learning algorithms (Luo et al. 2023; Cao et al. 2023).

A subset of artificial intelligence known as machine learning (ML) relies on automated, data-driven model construction. The creation of ML models involves the use of various training techniques. Recent works applied ML algorithms to carbon-based materials for TC adsorption (Taoufik et al. 2022; Zhu et al. 2021b). However, model predictions should be improved. The study by Zhu et al. (2021b) used carbon-based materials, including activated carbon and BC, with different compositions, so prediction models for both would result in large variance; secondly, their study had a small database, and the best correlation coefficient (R2) was only 0.8944, necessitating optimization of the ML model (Leng et al. 2022; Yang et al. 2022). Integrated learning models must be utilized to predict TC adsorption on a single BC to test the prediction effect.

For pattern identification based on ambiguous information, many ML techniques have been developed, including rough set theory (RST) (Pawlak 1982), fuzzy set theory (Goguen 1974), and evidence theory (Dempster 1967). These theories’ approximation-based methodology enables them to identify structural links in noisy and erratic data. In order to categorize things depending on the supplied data, Pawlak (1982) originally suggested RST, which is applied in classification, prediction, and decision analysis tasks, through which rough set-based machine learning (RSML) was developed. The input data for RSML’s information tables are the object characteristics that are further divided into attributes for conditions and decisions (Pawlak 1997). The factors that place the object into a certain judgment class are known as condition attributes. A subset known as reduct can be created by removing duplicate condition attributes. The rough set algorithm will provide a list of categorization rules using the reduct as its base. Multiple reducts are frequently generated in a case study, and the condition attributes that appear in all the reducts are regarded as the core attributes. If the information system’s essential properties were deleted, the classification rules would be rendered useless (Pawlak 1997). The use of RSML has many benefits, including the ability to produce rules based on various decision classes and validate the rules through their scientific coherency.

Though traditional ML models like artificial neural networks (ANN), support vector machines (SVM) and linear regression have been successful in many applications. They are criticized for being oblique (Rudin and Radin 2019), hard to interpret (Rudin 2019) and mandate of interpreting tools like SHAP (Shapley additive explanations) (Merrick and Taly 2020). Explainable artificial intelligence (XAI) has grown in popularity for applications that require more scientific outcomes apart from the statistical performance of the developed models as a result of these restrictions (Calegari et al. 2020). An XAI method called RSML creates if–then rules for categorizing data based on identified influential parameters. RSML has the advantage of locating hidden patterns in data sets and is based on Pawla’s (1982) RST to categorize items based on already existing knowledge. The if–then rules produced by RSML characterize and generalize data to predict the future outcome qualities using these guidelines. Though recent RSML research has been undertaken in a variety of domains. To date, only very few limited papers have reported on RSML for biomass pyrolysis, such as its BC energy potential (Tang et al. 2023), BC surface properties (Ang et al. 2023), bio-oil properties (Chong et al. 2022). The other studies using RSML include forecasting CO2 storage integrity (Aviso et al. 2019), building energy usage (Lei et al. 2021), detection of city energy usage and greenhouse gaseous (GHG) emissions (Aviso et al. 2021), estimating water quality (Albuquerque et al. 2021), forecasting impeller service life (Zhao et al. 2019), and evaluating variable effectiveness and success criteria for construction projects (Akbari et al. 2018).

BC feedstocks include numerous forms of biomass, such as crop wastes, agricultural residues, and algae. Biomass qualities (i.e., volatile matter, ash content, and carbon content) and pyrolysis conditions (i.e., temperature, heating rate and retention time) affect BC’s adsorption efficiency. However, the desired BC characteristics aimed for TC removal were dispersed, with no clear conclusion. It is essential to study the conditions that produce BC for TC adsorption. RSML can categorize characteristics to study conditional and decision attributes’ hidden relationships. This study examines the impact of biomass qualities and operation conditions on BC surface properties for TC removal. To the best of the authors’ knowledge, no prior literature exists that applies RSML algorithms to build performance prediction models for equilibrium sorption capacity for TC based on the combination of fifteen factors. However, few studies are available for predicting TC adsorption using BC through traditional ML methods (Zhang et al. 2023; Zhou et al. 2023). Thus, this study attempts to discover the applicability of rough sets in developing the prediction model that assists in choosing the optimized conditions for accomplishing maximum equilibrium sorption capacity.

The objective of this study was to create a general RSML model to forecast the sorption capacity of TC on BC based on the adsorbent characteristics and sorption circumstances. This study utilized four key steps in RST to develop a prediction model that included discretization, identification of core attributes and generation of reducts, generation of decision rules, and evaluation. Thus, the novelty of the current study is that it explored the comparative performance evaluation of RSML with incomplete and complete datasets. Since RSML produces a set of decision rules that express conditional and decision qualities, further validating and analyzing these rules results in the selection of decisive, interpretable rules. These guidelines should determine the biomass parameters and operating conditions that produce BC with the appropriate characteristics for maximizing TC adsorption.

2 Materials and methods

2.1 Collection, pre-processing of input BC data

Twenty-two types of BC and 295 sets of experimental adsorption data for the TC adsorption by BC were gathered (Tables S1 and S2) from recently published literature that was pertinent (Chen et al. 2018, 2021; Choi et al. 2020; Fan et al. 2020; Jang and Kan 2019; Kim et al. 2020; Shen et al. 2020; Wang et al. 2018; Xu et al. 2020; Zhang et al. 2019; Zheng et al. 2021). Data were taken from published studies using Plot Digitizer v3 (https://plotdigitizer.com/.) without author bias (Wilschut et al. 2022).

Fifteen key parameters such as pyrolysis temperature (Tpy, °C), pH of the BC in water (pH Char), total carbon in the BC (C, w%), molar ratio of oxygen and nitrogen to carbon [(O + N)/C], molar ratio of oxygen to carbon (O/C), molar ratio of hydrogen to carbon (H/C), ash content (Ash, w%), surface area (BET, m2/g−1), pore volume (PV, cm3/g−1), and BC pore size (PS, nm), adsorption temperature (Ta, °C) and adsorption solution pH (pH Solution) were compiled from the published literature and dataset was developed. Further, the initial concentration of TC (CTo, mg/L), BC dosage (Cchar, g/L) and the ratio of TC to BC (Co, mmol/g−1) were computed using Eq. (1) as outlined in Zhu et al. (2019),

$${C}_{o}=\frac{{C}_{To}}{\left({C}_{char}\times 444.4\right)}$$
(1)

where, \({C}_{To}\) and \({C}_{char}\) represents the initial concentration of TC (mg/L−1) and BC (g/L−1) considered in the corresponding study.

2.2 Translation and classification of BC’s conditional attributes

Fifteen crucial criteria were considered and grouped into four categories to frame the general rule for defining BC characteristics aimed at TC adsorption. The adsorption efficiency was expressed by equilibrium sorption capacity Qe (mg/g−1), and the four categories of input data considered were pyrolysis conditions, BC characteristics, adsorption conditions and the initial concentration ratio of TC to BC. Firstly, the pyrolysis temperature (Tpy, °C) were taken into consideration. Secondly, BC properties such as pH of the BC in water (pH Char), total carbon in the BC (C, w%), molar ratio of oxygen and nitrogen to carbon [(O + N)/C], molar ratio of oxygen to carbon (O/C), molar ratio of hydrogen to carbon (H/C), ash content (Ash, w%), surface area (BET, m2/g−1), pore volume (PV, cm3/g−1), and BC pore size (PS, nm) were considered. Thirdly, the adsorption conditions such as adsorption temperature (Ta, °C) and aqueous solution pH (pH Solution) were taken into consideration. Finally, the initial concentration of TC (CTo, mg/L), the initial concentration of BC (Cchar, g/L), and its ratio of TC to BC (Co, mmol/g−1) were also considered. These conditional attributes were identified and translated into measurable properties.

2.3 Development of RSML model

The sample dataset and the information table for the present study of TC adsorption on biochar using case 1 (Ideal dataset) are shown in Table S3a and b. Likewise, case 2 containing Practical datasets is shown in Table S4a and b. The RSML model was developed using a tabular decision table representation of the database. Each row in the table represents an object, and the columns represent attributes corresponding to each object, which can be categorized as condition attributes or decision attributes. The information table is represented as S = (U, A), where U is a universal set of non-empty finite objects and A is a non-empty finite set of attributes. Large datasets may contain indiscernible objects, which are objects that perform similarly in their attributes or features. Figure 1 depicts the framework of the RSML algorithm and how it works for the current study. The reduction of condition attributes was done by removing existing data that did not affect the final decision, resulting in fewer attributes considered, reducing the redundancy of data while retaining its basic features. The reduct refers to a subset of indispensable attributes that can partition the database with the same level of discrimination as the original set of attributes, while the core refers to the intersection of all reducts and represents the essential attributes set that cannot be excluded from the decision system without losing the equivalence class structure. The RSML algorithm predicts decision rules based on training data in the information, which explicates the class of output provided the set of conditional attributes is satisfied. The reduction of the set is performed based on the indiscernibility of objects, which depicts the relation between two or more objects with dissimilar target conditional attributes. The main purpose of this step is to negate the attributes that have no effects on the decision and to keep the influencing conditional attributes. The final information table is then used for approximation, reduction, and rule generation using open-source ROSE2 software, as outlined in Tang et al. (2023). More information on RST and its software implementation can be found in Prędki et al. (1998) and Prędki and Wilk (1999).

Fig. 1
figure 1

Overview of the RSML algorithm used in TC adsorption on BC

2.4 Performance evaluation of the developed RSML model

From the standpoint of Bayesian probability, the effectiveness of the created rules can be evaluated quantitatively based on the coverage, certainty, and strength of the rules (Pawlak 2002). As the rules produced by RSML may not always be deterministic, it is mandatory to evaluate the model inconsistencies through these three key criteria for understanding its impact on model performances. A rule’s generalization capacity is measured by its strength and coverage, while its predictive accuracy is assessed with certainty. It is noteworthy to mention that higher strength, certainty, and coverage features are preferred to be considered as the well-trained RSML model. A decision rule’s strength (Strx) is defined as the percentage of data points (suppx [C, D]) in a dataset (card [U]) that support it by adhering to the rule (Eq. [2]). The likelihood of an object being assigned to a decision class (Dx) if it demonstrates a particular set of conditional characteristics (Cx) is used to quantify a rule’s certainty (Cerx) (Eq. [3]). During the phase of selecting the decision rules, a rule with a higher certainty value is considered. The coverage (Covx) of a rule is the percentage of objects that it successfully categorizes in each decision class (Eq. [4]).

$${\text{Strength}}, {Str}_{x}=\frac{\mathrm{supp_{x}}({\text{C}},\mathrm{ D})}{\mathrm{card }({\text{U}})}$$
(2)
$${\text{Certainty}}, {Cer}_{x}=\frac{\mathrm{supp_{x} }({\text{C}}, \mathrm{ D})}{\mathrm{card }({\text{C}\{\text{x}\}})}$$
(3)
$${\text{Coverage}}, {Cov}_{x}=\frac{\mathrm{supp_x }({\text{C}},\mathrm{ D})}{\mathrm{card }({\text{D}\{\text{x}\}})}$$
(4)

2.5 Validation of the developed RSML model

The RSML model was validated using the k-fold cross-validation of the dataset to assess the performance and forecast accuracy of the reduct sets and the resulting decision rules. There were no duplicates between the validation set and the training set. The validation step is intended to evaluate the model’s performance on new data, which has never been used before to train the model. Positive performance on the validation set implies that the model has mastered the applicable general principles correctly. As demonstrated by the quantum of examples covered in the training dataset, underlying patterns in the data should be turned into rules with an appropriate balance of prediction accuracy and generalization strength. However, if the created rules exhibit low accuracy or coverage, it may be due to the presence of coverage clusters with distinct behavior in the dataset. Therefore, to improve the accuracy and coverage of the predictions, the RSML will be revised again by consolidating the best rules or minimizing the conflicting rules. These revisions should be stopped while the rules with the desired certainty and coverage are attained.

The k-fold cross-validation generated an N x N + 1 confusion matrix, where N is the number of desired classes in the output attribute. The diagonal region with higher values in the confusion matrix indicated the true positive values, which the model accurately predicted. The values in the off-diagonal region were the mispredicted values. The lower values in the off-diagonal region represented the higher accuracy of the model. Specific validation metrics like accuracy, precision, recall and F1 score were calculated using (Eqs. 5, 6, 7 and 8) from the confusion matrix to validate the model efficiency. Accuracy and precision were used to evaluate the model performance based on prediction effectiveness. On the other hand, recall quantifies the efficacy of a classification model in accurately identifying all pertinent instances within a given dataset. The F1-score metric was employed to assess the comprehensive performance of a classification model. It represents the harmonic mean of recall and precision. All the four metrics possessed values ranging from 0–100%.

$$Accuracy= \frac{TP+TN}{TP+TN+FP+FN}$$
(5)
$$Precision= \frac{TP}{TP+FP}$$
(6)
$$Recall= \frac{TP}{TP+FN}$$
(7)
$$\text F1\ \text {score}= 2* \frac {\text {Precision * Recall}}{\text {Precision + Recall}}$$
(8)

2.6 Comparative evaluation of the developed RSML model with the other classifiers

The developed RSML model was comparatively evaluated with existing classifier models through the Pycaret tool (Ali 2020), which is an open-source low-code machine learning library in Python that consists of several machine learning libraries and frameworks, such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more. The data has been passed through multiclass classifiers and validated using the k-fold cross-validation of the dataset to assess the performance and forecast accuracy of the reduct sets and the resulting decision rules.

3 Results and discussion

This study is carried out with 295 datasets collected from various literature. The datasets without any missing values were segregated as a separate database with a total of 94 datasets, and analysis was carried out as case 1. In case 2, the collected 295 datasets, of which 201 datasets had missing values, were designated as Practical datasets and considered for RSML analysis. Since the RSML algorithm is known to handle incomplete datasets, this study compared the impact of using Ideal datasets and Practical datasets in machine learning.

3.1 Exploratory data analysis

In this study, fifteen input parameters under four broad characteristics of pyrolysis conditions (Tpy), feedstock’s characteristics (ratio of ultimate analyses such as [(O + N)/C], (O/C), (H/C), and ash) and BC characteristics (such as pH Char, pore size (PS), BET surface area, and total pore volume (PV)) and adsorption experimental conditions (such as adsorption temperature [Ta], pH of the aqueous solution [pH Solution], initial TC concentration [CTo], BC dose [Cchar], and initial concentration ratio of TC to BC [Co]) were considered as the condition attributes. Meanwhile, the adsorption capacity of TC on BC was selected as the decision attribute under the RSML study. The rationale behind selecting all these fifteen parameters for the RSML study was as follows: as the pyrolytic temperature (Tpy) plays a major role in dictating the BC characteristics, it has been considered in the present study. In case the pyrolytic temperature is low, the resulting BC can show weak acidity due to the incomplete release of alkali salts. However, the average pH of the BC (pH Char) samples considered in the current study was around 9.2 and 9.1 in case 1 and case 2, respectively, and the BC was loaded into the aqueous solution containing TC for its removal using adsorption studies. Yet, both the pH Char and the pH of the aqueous solution (pH Solution) were taken into account for consideration as the pH of adsorption solutions was adjusted with either acid or base in all batch adsorption studies. Apart from the pH Solution, the initial TC concentration (CTo in g/L−1), initial BC concentration (CChar in g/L−1) and the ratio of TC to BC (Co in mmol/g−1) were also considered in the present study.

The exploratory data analysis (EDA) was performed by adapting the univariate graphical method to illustrate the utmost fundamental statistical explanations of data distribution. A violin plot was employed for data visualization as it is highly effective in displaying data distribution in a clear and informative way (as shown in Figs. 2 and 3 for both cases, respectively). The median of the data is represented as the white dot, and the bar represents the interquartile region where the majority of the data are present in the distribution, and the four quartiles (25%, 50%, 75%, and 100%) provide insights into the data distribution along with its range.

Fig. 2
figure 2

Violin plot depicts the distribution and density of Ideal dataset (case 1): (a) pyrolysis temperature (T, °C), (b) total carbon in the BC (C, w%), (c) molar ratio of oxygen and nitrogen to carbon [(O + N)/C], (d) molar ratio of oxygen to carbon (O/C), (e) molar ratio of hydrogen to carbon (H/C), (f) ash content (Ash, w%), (g) pH of the BC in water (pH_H2O), (h) BC pore size (PS, nm), (i) surface area (BET, m2.g−1), (j) pore volume (PV, cm3/g−1), (k) adsorption temperature (T, °C), (l) solution pH (pH_sol), (m) initial concentration of TC (mg/L−1), (n) initial concentration of BC (g/L−1), (o) the initial concentration ratio of TC to BC (C0, mmol/g−1) and (p) equilibrium sorption capacity Qe (mg/g.−1). (The value of y-axis ranges from -1 to + 1)

Fig. 3
figure 3

Violin plot depicts the distribution and density of Practical dataset (case 2): (a) pyrolysis temperature (T, °C), (b) total carbon in the BC (C, w%), (c) molar ratio of oxygen and nitrogen to carbon [(O + N)/C], (d) molar ratio of oxygen to carbon (O/C), (e) molar ratio of hydrogen to carbon (H/C), (f) ash content (Ash, w%), (g) pH of the BC in water (pH_H2O), (h) BC pore size (PS, nm), (i) surface area (BET, m2/g−1), (j) pore volume (PV, cm3/g−1), (k) adsorption temperature (T, °C), (l) solution pH (pH_sol), (m) initial concentration of TC (mg/L−1), (n) initial concentration of BC (g/L−1), (o) the initial concentration ratio of TC to BC (C0, mmol/g−1) and (p) equilibrium sorption capacity Qe (mg/g−1). (The value of y-axis ranges from -1 to + 1)

As efforts are made to optimize the key important factors to achieve maximum Qe, the factors that influence the Qe tend to vary in synchrony with each other. Therefore, it is necessary to emphasize the relationship between various feedstock and pyrolysis variables, as well as the ratio of TC to BC used. Figure 4 depicts the Pearson correlation matrix on the relationship among the fifteen input parameters with the desired output features of BC for attaining the maximum equilibrium sorption capacity (Qe) of TC. The sign of the Pearson correlation coefficient determines the type of correlation between parameters. The magnitude of the coefficient indicates the degree to which one parameter influences the others. As observed from Fig. 4a, factors such as C, BET, PV, CTo, and Co had significant positive correlations with the Qe in case 1. However, in the case of the Practical dataset (case 2) (Fig. 4b), Tpy and C showed a weak positive correlation Qe whereas BET, PV, CTo and Co exhibited significant positive correlations with Qe.

Fig. 4
figure 4

Pearson correlation matrix (a) Ideal dataset (case 1) and (b) Practical dataset (case 2), depicting relationship between the input and output features of TC adsorption using BC for achieving maximum equilibrium sorption capacity

The BCs considered in the study were rich in carbon content from approximately 40% to the maximum value of 90% in case 1, while the carbon content ranged from approximately 30% to 92%. It is well evident that the carbon content increases with increasing pyrolytic temperature and accumulates during the thermochemical processing of biomass feedstocks. However, the feedstock ratios of the ultimate analyses, such as [(N + O)/C], (H/C) and (O/C), indicate the polarity indices, aromaticity, and hydrophilicity of BC, respectively. While the higher [(N + O)/C] ratio represents higher polarity, the higher (H/C) and (O/C) ratio signifies lower aromaticity and higher hydrophilicity, respectively. Further, the BC characteristics such as BET surface area, pore structure and pore volume were also considered in the RSML study.

However, our earlier studies revealed that the initial TC concentration, BET surface area and pore volume had a significant positive correlation with the adsorption capacity of TC on BC (Zhang et al. 2023). This is because the adsorption capacity per unit of adsorbed mass will increase while the adsorption dosage rises, though the adsorption value is constant. Likewise, in the case of constant adsorbent (BC), the increase in BET surface area could also lead to the higher adsorption of adsorbates (TC) per unit mass of adsorbent. More pore volume signifies higher pores per unit of adsorbent, which results in higher adsorption capacity.

3.2 Assessment of cores and reducts

The fifteen conditional attributes related to the highest TC adsorption capacity on BC (decision attribute) were derived from cores and reducts observed through an RSML study. Two cases, an Ideal dataset (n = 94) and a Practical dataset (n = 295), were considered for the current RSML study. The sample dataset and the information table for studying TC adsorption on biochar are shown in Tables S3 and S4 for case 1 and Tables S5 and S6 for case 2. With the established decision system for TC adsorption on BC with the highest capacity, 4 cores and 15 reducts were identified and generated by RST using ROSE2 software for case 1. However, for case 2, only 6 cores and 7 reducts were generated. As the core factors, pH of the aqueous solution, initial TC concentration, BC dosage, and initial concentration ratio of TC to BC were known as the most decisive subset of attributes in the decision table for both cases. In other words, the pH Solution, CTo, Cchar and Co cannot be excluded from the decision system without influencing the classification power of the adsorption capacity. It is noteworthy to mention that when the datasets were incomplete, as in case 2, the decision system demanded the inclusion of two additional attributes, such as pyrolysis temperature and adsorption temperature, as the cores in addition to the existing four cores. Table 1 represents the number of cores and reducts generated along with their respective number of generated rules for TC adsorption on BC. It is noteworthy to highlight that the total number of generated rules for a case cannot be summed up under each reduct, as the rules may overlap across various reducts. For instance, in case 1 of the Ideal dataset, 15 reducts were induced, and approximately 36 rules were generated. Therefore, summing up those rules under 15 reducts would result in a higher number than the total number of generated rules.

Table 1 Cores, reducts and the number of rules generated through RSML for the two different datasets of TC adsorption on BC

3.3 Rules generated for TC adsorption by BC

RST induced 36 rules from all 15 reducts for case 1 of the Ideal dataset, whereas 6 rules were classified into class 1 decision of Qe greater than 200 mg/g (Table 2). Likewise, 2, 13 and 11 rules were generated for class 2, 4 and 5 decisions of Qe, respectively. No rules could be generated for class 3 decisions as it had only one dataset. Further, two approximate rules were generated for the decision attributes of Qe, except class 1. RSML was able to provide classification between classes 1 to 5 with relatively high certainty and coverage. All the generated rules, along with their respective relative strength, coverage factor, and certainty, are listed in Supplementary Table S5.

Table 2 List of rules generated for TC adsorption on BC using RSML algorithms for class 1 decision (Qe > 200 mg/g) using ideal and practical dataset

As a rule of thumb, the rules with high certainty and coverage should be selected for further consideration. It is noteworthy to mention that all the rules generated in both cases had 100% certainty, which revealed the well-trained model. Figure 5 presents the strength and coverage for all the generated rules in both cases. Further, the low dataset during model development might also cause the generated rules to have the weakest classification power.

Fig. 5
figure 5

Plot of (a) Ideal dataset (case 1) and (b) Practical dataset (case 2) between strength (number of data) of the rules and its coverage for the rules generated for TC adsorption on BC

For instance, in the case of Class 1 under the Ideal dataset, rules (2, 3, 5, 6) reveal the pH range of 3–7 was much more suitable for enhanced TC adsorption, which can be obviously corroborated from the studies of Zhang et al. (2019) over the different pyrolytic conditions and pH range. Further, their experiments revealed that biochar produced at higher temperatures exhibited maximal adsorption capacity of TC at acidic pH due to the deprotonation of BC surface leading to electrostatic interactions between BC and TC.

In the RSML for case 2 of the Practical dataset, the number of rules generated was 92, including six approximate rules. Out of which, 15 deterministic rules for class 1 were generated. 15, 9, 26 and 21 deterministic rules for classes 2, 3, 4 and 5, respectively, were created by RSML as shown in Table S6. The higher number of rules generated might be due to the fact that data was three-fold higher when compared to case 1, although the reducts were half as high as those of case 1.

For instance, considering rule 1 of case 2 (Practical dataset) from Table 2, the predicted process condition for achieving Qe > 200 mg/g involves biochar properties [O/C > = 1], pH of the solution for adsorption reaction at 6 with TC to BC ratio (C0) greater than 2. This rule provides one of the possible conditions for achieving Qe > 200 mg/g with the aid of 3 main parameters rather than focusing on all the 15 parameters. Out of these 3 parameters, the pH of the solution is a crucial parameter for the adsorption phenomenon to occur at a higher rate, and the TC to BC ratio is required to ensure the presence of enough adsorbent (BC) for capturing TC.

3.4 Cross-validation of the generated rules for TC adsorption by BC

The confusion matrix depicts the classification efficiency of datasets under each class for both cases and the highest values in the diagonal of the matrix revealed the appropriate classification. However, for case 1, class 2 and class 3 had zero values, which had not been appropriately classified. It might be due to the very low dataset in the respective classes. On the contrary, the classification of the datasets based on the trained model in case 2 for all the classes was found to be relatively satisfactory (as seen in Table 3) than that of case 1, which might be due to the higher number of datasets.

Table 3 Confusion matrix for the two different datasets of TC adsorption on BC using RSML algorithm

It is evident that evaluating the performance of machine learning models is challenging with limited data. Thus, k-fold cross-validation is well-suited for research using sparse datasets. The current study employed k-fold (10-fold) cross-validation to evaluate model performance after randomly splitting the datasets into training (k-1) sets and the rest as test sets. This process was repeated 10 times to serve as the test set once for each subset. The classification model was built using the training set and validated on the test set. Model efficiency was quantified based on true positives, true negatives, false positives and false negatives. Precision and recall are statistical measures used for validation. Precision refers to the proportion of correctly predicted positive observations among all predicted positive observations, while recall refers to the proportion of actual positives that are correctly identified. The F1-score is another statistical measure used to evaluate the developed RSML. Table 4 presents the cross-validation attributes of RSML algorithms for both cases. In the case of class 1 (Qe > 200 mg/g), the accuracy noticed was 89.25% for case 1, while it was 93.22% for case 2. Likewise, precision was shown to be slightly higher in case 2 than in case 1, which signifies that even in the practical dataset, the trained model could exhibit the most correctly predicted positive observations without any imputations. However, recall and F1-score have shown a declining trend in case 2, which might be due to the large amount of missing data in the Practical dataset. Similarly, the interpretations of the remaining classes between the two cases shall be drawn from Table 4. As mentioned earlier, the number of rules generated in class 2 and class 3 of case 1 were two and zero due to the very low availability of the dataset, and the precision, recall, and F1 scores were zero for all, respectively.

Table 4 Performance evaluation of RSML algorithms for the two different datasets of TC adsorption on BC

3.5 Comparative evaluation of RSML algorithm with the other classifier models for TC adsorption by BC

A wide array of ML algorithms has been performed to identify the best model that yields the greatest performance in the given classification task. Several commonly used classification algorithms from pycaret, such as Extra Trees Classifier, Gradient Boosting Classifier, Random Forest Classifier, Extreme Gradient Boosting, Decision Tree Classifier, Light Gradient Boosting Machine, K Neighbors Classifier, Logistic Regression, Linear Discriminant Analysis, Ridge Classifier, Naive Bayes, Quadratic Discriminant Analysis, AdaBoost Classifier, and SVM—Linear Kernel were utilized to compare the performance of TC adsorption on biochar with RSML. The model comparison has been done with the case 2 practical dataset as it performed better than the case 1 Ideal dataset. Table 5 summarizes the performance metrics of the RSML algorithm along with the other classifier algorithms. Since the Practical dataset was segregated based on users’ prejudiced Qe values, it resulted in an imbalanced dataset. This explains the reason behind low precision, recall and F1 score values rather than accuracy. However, the developed RSML model exhibited better accuracy when compared to the other classifier models, which may be due to the fact that RSML works based on rough set theory for deducing better approximations. Herein, the RSML model performed well with an accuracy of 88.34%, followed by the extra trees classifier (81.05%) and gradient boosting classifier (80.60%).

Table 5 Comparative evaluation of RSML algorithms with other classifier models for practical datasets of TC adsorption on BC

3.6 Comparative analysis of the optimized range of conditions needed for maximizing Qe of TC on BC

Table 6 summarizes the selected rules to obtain the maximum TC adsorption capacity (say more than 200 mg/g) on BC. Based on the RSML model trained on Ideal dataset, the pyrolysis temperature conditions of 300 °C with the biomass characteristics based on the ratio of ultimate analysis such as carbon content in BC greater than 87%, O/C between 0.1–0.19 and (O + N)/C between 0.1–0.2 in the adsorption conditions of pH of solution between 3 to 7 would result in TC adsorption capacity (Qe) > 200 mg/g on BC. While in case 2 of the model that was trained on the Practical dataset, the RSML demanded additional parameters on BC properties (such as pore volume, pore size and surface area) to predict the Qe > 200 mg/g of TC on BC.

Table 6 Optimal range of condition attributes for Class 1 TC adsorption capacity of Qe > 200 mg/g on BC

It is noteworthy to mention that the researchers are trying to produce BC with the desired attributes for the maximized adsorption of target pollutants. That is why the prime focus of the discussion was confined to the interpretations for class 1 in both cases. Apart from the accurate rules, RSML offers approximate rules that might classify the output in either of the classes. Herein, in case 1, two approximate rules were generated, and both exhibited 100% coverage and certainty. While in case 2, seven approximate rules with coverage ranged from 14.2 to 100%, and the certainty of 100% was achieved. The results indicate that RST can effectively select relevant attributes that improve predictive performance. According to RST, its feature selection method has the ability to eliminate irrelevant data, simplifying the decision-making process. This likely explains why rough set modeling yielded satisfactory prediction results in this study.

4 Conclusion

Using an Ideal and Practical dataset, two rule-based RSML models were developed to estimate TC adsorption capacity on BC. Both models produced scientifically coherent decision rules. However, the Practical dataset model performed better. The model trained with the Ideal dataset suggested Tpy, C, O/C, (O + N)/C, and pH Solution were essential for Qe > 200 mg/g. Yet, the model developed with a Practical dataset demanded BC properties in addition to the aforementioned attributes to achieve the same purpose. The model trained with the Practical dataset provided that the pyrolysis temperature at 300 ℃ with TC to BC ratio between 1 and 2 is required to achieve an adsorption capacity greater than 200 mg/g. This study demonstrated an interpretable RSML tool to estimate BC adsorption capacity using a Practical dataset without imputation, thus minimizing bias and variances during decision-making.