Introduction

Structure–activity relationship (SAR) has been frequently used to predict the biological activities of chemicals from their molecular structures. One of the major challenges in SAR-based chemical classification or drug discovery is the extreme imbalance between active and inactive chemicals [1]. Despite the existence of as many as 107 commercially available molecules [2], there is almost always a skew in the distribution of molecules across the bioactivity landscape or toxicity classes. Biomacromolecules such as proteins are often highly selective in their binding to small molecular ligands. Regardless of the huge chemical space, only a few compounds are likely to interact with a target biomacromolecule causing biological effects and are consequently labelled as active compounds, whereas the remaining majority are labelled as inactive compounds. This gives rise to a common problem of class imbalance for SAR-based predictive modeling, particularly in chemical classification and activity quantification using machine learning approaches [3,4,5].

In machine learning, classifiers are built on data statistics and require a balanced data distribution to achieve optimal performance. Classifiers trained from imbalanced data tend to have a bias towards the majority class. This leads to low sensitivity and precision for the minority class [6], even though the minority class is usually of greater importance than the majority class [7, 8]. In fields such as toxicology and disease diagnosis, bias towards the majority class may result in a higher rate of false negative predictions [1].

The problem of data imbalance has been studied in the context of machine learning-based SAR modeling for more than two decades [7, 9, 10]. As a result, a plethora of methods have been proposed to alleviate the skewness of class distribution. These methods can be grouped into three categories: data-level, algorithm-level, and hybrid [7, 11]. Data-level methods aim to rebalance the training dataset’s class distribution either by undersampling the majority class or oversampling the minority class [12, 13]. They also include methods that clean overlapping samples and remove noisy samples that may negatively affect classifiers [13, 14]. Algorithm-level methods attempt to alter a given learning algorithm by inducing cost sensitivity that biases a model towards the minority class. For example, this may be achieved by imposing a high misclassification cost for the minority class [7, 11]. Recently, Mondrian conformal prediction (MCP) has been applied to improve the performance of machine learning from imbalanced datasets by computing nonconformity scores to model the reliability of predictions. This allows for identifying reliable predications at user-defined significance and confidence levels [15,16,17,18,19]. The MCP approach does not require data rebalancing. Hybrid methods combine the use of resampling strategies with special-purpose learning algorithms [11]. Ensemble approaches (e.g., bagging and boosting), known to increase the accuracy of single classifiers, have also been hybridized with resampling strategies [6].

The selection of appropriate metrics plays a key role in evaluating the performance of imbalanced learning algorithms [11, 20]. In consideration of user preference (e.g., identifying rare active chemicals) and data distribution, a number of metrics have been proposed, including precision, recall, Area Under the Precision-Recall Curve (AUPRC) [21], Area Under the Receiver Operating Characteristics (AUROC) [22], F-measure, geometric mean (G-mean), balanced accuracy, etc. [23,24,25,26]. For instance, precision is not affected by a large number of negative samples because it measures the number of true positives out of the samples predicted as positives (i.e., true positive + false positive). A high AUPRC represents both high recall and high precision. High precision relates to a low false positive rate, and high recall relates to a low false negative rate [21, 27].

The present study was motivated by the scarcity of reported efforts in the application of the above-mentioned methods to the SAR-based chemical classification domain. We conducted a literature survey which only identified a few studies in this domain where cost-sensitive learning [28, 29], resampling [29, 30], conformal prediction [18] and extreme entropy machines [1, 31] were employed to specifically deal with data imbalance. Although predictive modeling was improved for certain datasets, a consistent performance enhancement was not observed as a result of resampling and algorithm modification. Apparently, more studies are warranted to further examine such questions as: (1) Does imbalance ratio (IR), i.e., inactive-to-active sample ratio, affect the effectiveness of data-level methods (particularly resampling methods)? (2) Would different data rebalancing techniques affect the performance of a classifier differentially, and does the combination of undersampling and oversampling techniques, such as SMOTEENN (SMOTE + ENN) [32], outperform an undersampling or oversampling technique alone? (3) What metrics can better evaluate the results of imbalanced learning in SAR-based chemical classification? This study attempted to address all three of these questions.

To address the first question, we selected twelve binary datasets of 10 K compounds with varying degrees of imbalance, which were generated within the Toxicology in the 21st century (Tox21) program [33] and used for the Tox21 Data Challenge 2014 [34, 35] (https://tripod.nih.gov/tox21/challenge/about.jsp). To address the other two questions, we chose nine evaluation metrics, compared three resampling algorithms integrated with the base classifier (random forest—RF), and performed statistical analysis to rank the metrics.

In this work, we selected RF as the base classifier and bagging as the ensemble learning algorithm to improve the stability and accuracy of model predictions. Then, we applied three representative resampling methods for data imbalance handling, i.e., random under-sampling (RUS), the synthetic minority over-sampling technique (SMOTE) and SMOTEENN (i.e., a combination of SMOTE and Edited Nearest Neighbor (ENN) algorithms). Consequently, four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with RUS (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN) were tested. Here, we did not intend to conduct a comprehensive or exhaustive comparative investigation of all existing imbalance handling methods, but rather to use this case study to demonstrate that appropriate handling of imbalanced data and the choice of appropriate evaluation metrics could improve SAR-based classification modelling. We also investigated the performance of these existing approaches and highlighted their limitations regarding imbalance ratio. The rest of the paper is organized as follows: “Materials and methods” section covers the study design, data curation and preprocessing steps, imbalance handling methods, and performance metrics. “Results and discussion” section presents our classification performance results, statistical analysis, and a comparison with published results for the Tox21 datasets. Lastly, “Conclusions” section briefly summarizes the major findings from this study and concludes with some remarks on future research needs.

Materials and methods

Study design

The workflow of our study design is outlined in Fig. 1. It consists of data preprocessing, feature generation and selection, resampling, model training (ensemble learning), model testing and performance evaluation. The data preprocessing and feature generation steps were applied to a total of 12,707 compounds in the raw dataset of 12 assays. However, feature selection, resampling and training of classifiers were conducted separately for each individual assay. For each assay, the preprocessed compounds in the training set were split into N stratified bootstrap samples with replacement (i.e., samples were randomly selected but retained the same imbalance ratio). This was followed by ensemble learning either without resampling (RF) or with the application of a resampling technique (RUS, SMOTE, or SMOTEENN). Optimal parameters for each base learner were obtained via grid search with fivefold cross validation. Optimized base learners were combined to form the final ensemble learner. Evaluation metrics were calculated using the prediction results of RF, RUS, SMO and SMN to statistically compare their performance. Details of the workflow are presented below.

Fig. 1
figure 1

Workflow of structure–activity relationship (SAR)-based chemical classification with imbalanced data processing designed for this study

Chemical in vitro toxicity data curation

The Tox21 Data Challenge dataset used in this study consisted of 12 quantitative high throughput screening (qHTS) assays for a collection of over 10 K compounds (with redundancy within and across assays). The 12 in vitro assays included a nuclear receptor (NR) signaling panel and a stress response (SR) panel. The NR panel comprised 7 qHTS assays for identifying compounds that either inhibited aromatase or activated androgen receptor (AR), aryl hydrocarbon receptor (AhR), estrogen receptor (ER), or peroxisome proliferator-activated receptor γ (PPAR-γ). The SR panel contained 5 qHTS assays for detecting agonists of antioxidant response element (ARE), heat shock factor response element (HSE) or p53 signaling pathways, disruptors of the mitochondrial membrane potential (MMP), or genotoxicity inducers in human embryonic kidney cells expressing luciferase-tagged ATAD5. There were three sets of chemicals: a training set of 11,764 chemicals, a leaderboard set of 296 chemicals and a test set of 647 chemicals [35]. For this study, we merged the leaderboard set with the original training set to form our “training set” and retained the original test set as our “test set”. The Tox21 dataset was downloaded in SDF format at https://tripod.nih.gov/tox21/challenge/data.jsp. There were four possible assay outcomes for each compound: active, inactive, inconclusive or not tested. Only those chemicals labeled as either active (1) or inactive (0) were retained for this study.

Compound preprocessing and chemical descriptor (feature) generation

Chemical structures were also downloaded at https://tripod.nih.gov/tox21/challenge/data.jsp as SMILES files. Data standardization/cleaning was carried out using MolVS [36], a publicly available tool built on RDKit [37]. Standardization involved a fragmentation step as described in [25] where compounds possessing distinct structures not linked by covalent bonds were split into separate “compound fragments”. Then, solvent fragments, salts and problematic molecules with inconsistent resonance structures and tautomers [38], which should not contribute to the biological effect of a compound [39], were removed. The resulting SMILES entries were canonicalized by standardizing chemotypes such as nitro groups and aromatic rings, and the largest uncharged fragments of the compound were retained. After standardization, the resulting fragments were merged based on their reported activity to exclude replicates and conflicting instances. Specifically, only one instance of a set of duplicates was retained with the most frequent activity label, while duplicates with ambiguous activity labels (i.e., an equal number of active and inactive outcomes for the same chemical) were removed. Three types of molecular features (> 2000 in total), i.e., RDKit descriptors, MACCS (Molecular ACCess System) keys and Extended-Connectivity Fingerprints (ECFPs) [40] with a radius of 2 and a fixed bit length of 1024, were generated using RDKit [37] to characterize the final set of compounds. All features with zero variance were dropped.

Sampling and classification methods

Here we briefly describe the three resampling techniques (i.e., RUS, SMOTE and SMOTEENN) that we used for handling imbalanced data with RF chosen as the base classifier.

RUS

RUS is a widely used undersampling technique which randomly removes samples from the majority class. In our study, RUS was used to randomly remove inactive compounds. While RUS alleviates imbalance in the dataset, it may potentially discard useful or important samples and increase the variance of the classifier. Recent studies have shown that the integration of RUS with ensemble learning can achieve better results [6, 41]. To overcome its drawbacks, we combined RUS with bagging (an ensemble learning algorithm) for SAR-based chemical classification.

SMOTE

SMOTE is an oversampling technique that creates synthetic samples based on feature space similarities between existing examples in the minority class [12]. It has shown a great deal of success in various applications [20]. To create a synthetic data sample, we first took a sample from the dataset of the minority class and considered its k-nearest neighbors based on Euclidian distance to form a vector between the current data point and one of those k neighbors. The new synthetic data sample was obtained by multiplying this vector by a random number α between 0 and 1 and adding the product to the current data point. More technical details on how to create synthetic samples are described in the Additional file 1: Figure S1 and in [12, 20]. Applying SMOTE to the minority class instances can balance class distributions [12] and augment the original dataset in a manner that generally significantly improves learning [20].

SMOTEENN

Despite many promising benefits, the SMOTE algorithm also has its drawbacks, including over generalization and variance [20]. In many cases, class boundaries are not well defined since some synthetic minority class instances may cross over to appear in the majority class space, especially for nonlinear data with a large feature space [42]. As a result, some new synthetic samples in the minority class may be mislabeled and attempting to learn from such datasets often results in a higher false prediction rate [43]. To remove the mislabeled samples created by the SMOTE technique, we applied SMOTEENN [32], a combination of SMOTE and the Edited Nearest Neighbor (ENN) [44] algorithm, to clean the synthetic data samples.

In the ENN algorithm, the label of every synthetic instance is compared with the vote of its k-nearest neighbors. The instance is removed if it is inconsistent with its k-nearest neighbors; otherwise, it remains in the dataset. The process of removing mislabeled samples and retaining the valid synthetic instances is illustrated in the Additional file 1: Figure S1c. A higher k value in the edited nearest neighbors algorithm leads to a more stringent cleaning rule that allows more synthetic instances to be eliminated. Applying SMOTEENN to an imbalanced dataset does not automatically result in a perfectly balanced set after resampling, but it creates more meaningful synthetic samples in the minority class and reduces the imbalance ratio to a more manageable level.

RF and ensemble learning

RF is a robust supervised learning algorithm that has been widely used for classification in many applications in data science [45]. An RF model consists of many individual decision trees that operate as an ensemble. The individual decision trees are generated using a random selection of features at each node to determine the split. During classification, each tree votes and the class with most votes becomes the model’s prediction.

RF can be built [46] and improved [47] using bagging (short for bootstrap aggregation). Bagging is a common ensemble method that uses bootstrap sampling in which several base classifiers are combined (usually by averaging) to form a more stable aggregate classifier [48]. Each base classifier (RF in this study) in the ensemble is trained on a different subset of the training dataset obtained by random selection with replacement, thus introducing some level of diversity and robustness. It is well known that the bagging classifier is more robust in overcoming the effects of noisy data and overfitting, and it often has greater accuracy than a single classifier because the ensemble model reduces the effect of the variance of individual classifiers [6, 48, 49].

In our case, the Tox21 dataset was both highly dimensional and highly imbalanced [6, 50]. For datasets with such a large feature space and a small number of minority class samples, classification often suffers from overfitting. Because bagging is less susceptible to model overfitting, we chose it as the ensemble method. Combining the base classifier RF with three sampling techniques (RUS, SMO and SMOTEENN) and bagging, we assembled four hybrid classification methods: (1) RF without resampling, (2) RF + RUS, (3) RF + SMO, and (4) RF + SMOTEENN. For more convenient result analysis, the four methods were simply denoted as RF, RUS, SMO and SMN, respectively.

Here we use SMN as an example to illustrate the algorithm that integrates resampling with ensemble learning (see Algorithm 1 and Fig. 1). First, a subset, \({S}_{i}\), was obtained by taking a stratified bootstrap sampling from the training set, \(X\). This sampling process was repeated N times, where i = 1 to N, with N ranging between 5 and 100 in steps of 5. Stratification was employed to ensure that each bootstrap had the same class distribution as the entire training set. Each subset is used to train a classifier in the ensemble, hence N is also equivalent to the number of classifiers. Then, the SMOTEENN algorithm was applied to \({S}_{i}\) to oversample the minority class and obtain an augmented training subset \(S^{\prime}_{i}\), which was used to train a random forest classifier \({f}_{i}\left(x\right)\). The parameters for each classifier in the ensemble were selected using a grid search with a fivefold cross-validation. This would give every individual classifier a chance to attain its best performance and contribute optimally to the ensemble. The final ensemble model was a bagged classifier that would count the votes of the N classifiers and assign the class with the most votes to a chemical in the test dataset. The other three methods RF, RUS and SMO also employed Algorithm 1 with the only difference being the resampling technique, i.e., no resampling, RUS and SMOTE, respectively. All classifiers were implemented using the Scikit-learn package [51] and Imbalanced-learn in a Python toolbox [52].

figure a

Performance evaluation metrics

The output of a binary classification model can be primarily represented by four terms: (1) true positive (TP) defined as the number of true active chemicals that are correctly predicted as active by the model; (2) false positive (FP) as the number of true inactive chemicals incorrectly predicted as active; (3) true negative (TN) as the number of true inactive chemicals correctly predicted as inactive; and (4) false negative (FN) as the number of true active chemicals incorrectly predicted as inactive. Most evaluation metrics are derived from these four terms. True positive rate (TPR), also referred to as sensitivity or recall, represents the fraction of correctly predicted active chemicals. In SAR modeling, recall is also considered as a measure of the accuracy of the active (minority) class. True negative rate (TNR) or specificity provides a similar measure (accuracy) for the inactive (majority) class. Precision estimates the probability of a model to make a correct active class prediction. F1 score is the harmonic mean of precision and recall. Similarly, balanced accuracy (BA) is the average of correct predictions for both classes. Matthews correlation coefficient (MCC) offers a good index for the performance of imbalanced classification tasks as it incorporates all the components of the confusion matrix [53]. MCC has been widely used to evaluate the performance of SAR-based chemical classification [34, 54]. The MCC value varies in the range of [− 1, 1] with − 1 implying disagreement, 1 complete agreement and 0 no correlation between the prediction and the known truth. The Brier score is a measure of the average squared difference between the predicted probabilities and the known value for a class, and it assesses the overall accuracy of a probability model. The formulas of these evaluation metrics are given as follows:

$${\text{Recall}} = {\text{Sensitivity}} = \frac{TP}{{TP + FN}}$$
$${\text{Specificity}} = \frac{TN}{{TN + FP}}$$
$${\text{Precision}} = \frac{TP}{{TP + FP}}$$
$${\text{F}}_{{1}} \,{\text{score}} = 2 \times \frac{Precision \times Recall}{{Precision + Recall}}$$
$${\text{Balanced accuracy }}\left( {{\text{BA}}} \right) = \frac{Sensitivity + Specificity}{2}$$
$${\text{MCC}} = \frac{{TP \times TN{ }{-}{ }FP \times FN}}{{\sqrt {\left( {TP + FP} \right)\left( {TP + FN} \right)\left( {TN + FP} \right)\left( {TN + FN} \right)} }}$$
$${\text{Brier score}} = \frac{1}{N}\sum\nolimits_{i = 1}^{N} {\left( {p_{i} - o_{i} } \right)}^{2}$$

where N is the total number of chemicals in a dataset, \({p}_{i}\) (\(\in \left[\mathrm{0,1}\right])\) is the predicted probability, and \({o}_{i}\) is the ground truth for the ith chemical (equal to 1 for active and 0 for inactive).

In addition, the two widely used metrics AUROC and AUPRC were also calculated using Scikit-learn [51] to evaluate and compare the overall performance of a classifier against another. Finally, sensitivity–specificity gap (SSG), calculated as the absolute value of the difference between sensitivity and specificity, was introduced as a metric to evaluate how balanced a classifier was in terms of its performance on these two metrics [13].

We performed statistical analysis to assess if there existed significant differences among the four investigated classification methods in terms of their performance metrics across the twelve bioassays (Table 1). We adopted a nonparametric test for multiple comparisons as described in Garcia et al. [55]. Using the Statistical Comparison of Multiple Algorithms in Multiple Problems (scmamp) library in R [56], we conducted a Friedman’s aligned-rank test [57]. The Friedman test was chosen over other statistical tests such as ANOVA because it does not require the assumption of data normality. The Bergmann-Hommel post-hoc test was carried out for pairwise comparisons between SMN and the other three methods (RF, RUS and SMO) [54].

Table 1 Class distribution and imbalance ratio (IR) of the preprocessed training and test chemical datasets from Tox21 Data Challenge

Results and discussion

In this section, we present (1) a summary of the curated and preprocessed Tox21 dataset, (2) the preliminary comparative results to justify the selection of RF as the base classifier, (3) parameter optimization for RF and ENN algorithms, (4) performance metrics of four classification methods for the twelve imbalanced Tox21 datasets, (5) the impact of IR and classification methods on prediction performance, and (6) a comparison between this study and published Tox21 studies.

Data curation and preprocessing

A summary of the preprocessed training and test datasets of chemicals and their activities measured by 12 qHTS in vitro assays is presented in Table 1. Although the original raw Tox21 datasets contained more than 12 K chemicals, approximately 50% of them or fewer were retained for each assay after preprocessing. This was primarily due to duplication and the absence of testing data for individual assays. The imbalanced ratio (IR), defined as the ratio of the number of the majority class (inactive compounds) to that of the minority class (active compounds) [42], varied widely between assays and between the training and the test sets. Such large disparities offered a great opportunity to investigate the performance of different ensemble-resampling approaches as a function of IR (see below for detailed results). In the training datasets, the highest IR of 41.7 appeared in the dataset of the NR-PPAR-γ assay, whereas the lowest IR of 5.7 was observed with the SR-MMP assay. The test datasets generally had IRs larger than or equivalent to those of their corresponding training datasets, e.g., measuring as high as ~ 70 for NR-AR-LBD (except for NR-Aromatase, NR-PPAR-γ, and SR-ATAD5).

Selecting RF as the base classifier

A comparison of six popular machine learning algorithms, i.e., RF, K-nearest neighbors (KNN), decision trees (CART), Naïve Bayes (NB), support vector machine (SVM) and multilayer perceptron (MLP), was performed using the training datasets of all twelve assays and a stratified fivefold cross validation. These algorithms were all implemented in Scikit-learn [51] with default parameter settings. The purpose of this preliminary study was to select a base classifier from these algorithms. F1 score was calculated and used as the metric to evaluate classification performance. As shown in Fig. 2, RF was the frontrunner for four of the 12 assay datasets, including NR-AR-LBD, SR-ARE, SR-HSE, and SR-MMP. RF was the second best performer for another five assays (i.e., NR-AR, NR-ER, NR-ER-LBD, NR-PPAR-γ, and SR-p53). The average F1 score of RF for all 12 assays was the highest (0.2783) among all six algorithms, and the runner-up was MLP with an average F1 score of 0.2487. Clearly, RF outperformed the other five algorithms on the Tox21 dataset, which informed our decision to proceed with choosing RF as the base classifier and to focus our study on imbalance handling methods.

Fig. 2
figure 2

A spot check of six popular machine learning algorithms: performance of classifiers trained using the preprocessed Tox21 training datasets as evaluated using F1 score

Furthermore, the RF classifier was widely used by the participating teams in the Tox21 Data Challenge [28, 48]. Two of the winning teams developed RF models that achieved the best performance in predicting compound activities against AR, aromatase, and p53 [58] as well as ER-LBD [59]. Using the same RF classifier and the same dataset made it convenient to compare our results with those from the participating teams and allowed us to better investigate the impact of resampling methods on improving imbalanced learning and, consequently, improving classification performance (see “Comparison with Tox21 Data Challenge winners” section below for more info).

Parameter optimization for the RF classifier

It is generally accepted that the accuracy of a classifier ensemble is positively correlated with ensemble diversity [60]. Here, we adjusted the ensemble diversity by randomly selecting data instances to create the bootstrap samples (see Fig. 1) and by increasing the number of classifiers included in the ensemble. Figure 3 shows that the performance of classifier ensembles measured by the average F1 score, AUPRC, AUROC and MCC for all four methods changes with the varying number of classifiers in the ensemble. A plateau was encountered when the number of classifiers reached 30, which may have been the optimal number of classifiers in this situation. After this point, there was little improvement in performance as the number of classifiers increased. Even if minor improvements were noticed using 100 classifiers for some metrics (e.g., MCC), this dramatically increased the computational time and resources needed to train the model. The relationship between performance and the number of classifiers may be explained by the importance of diversity in ensemble learning. With every bootstrap sample being different from another in terms of chemical composition and fingerprint features, diversity in the bagging ensemble was inherent. However, as the number of classifiers increased, the number of times (frequency) that a sample was selected from the same population also increased. This would result in a decline in the variance between such bootstrap samples or a flat line in ensemble diversity. Consequently, a flat line was observed in performance metrics as the number of classifiers in an ensemble increased from 30 to 100 (Fig. 3). In the subsequent experiments, we adopted the optimal number of 30 classifiers for ensemble learning.

Fig. 3
figure 3

The relationship between model performance and the number of classifiers in the RF base classifier

Optimal number of nearest neighbors (k) in the ENN algorithm of SMN models

Another parameter we optimized was the k value in the ENN algorithm. The choice of a synthetic instance to be removed from the training set is determined by the voting of its k neighbors. As shown in Fig. 4, we varied the number of nearest neighbors k from 1 to 5, and 3 appeared to be the optimal k value for most of the five measured performance metrics. F1 score and AUPRC peaked at k = 3, BA plateaued when k = 3 or 4, whereas MCC peaked earlier at k = 2. AUROC was the only metric not affected by the change in k value. Thus, the k value was set at 3 for SMN in this study.

Fig. 4
figure 4

Performance metrics of SMN models measured as the number of nearest neighbors (k) varied in the ENN

By setting k at this optimal value, ENN may help increase the classifier’s generalizability by removing noisy (mislabeled) synthetic instances introduced in the SMOTE step. By reducing the amount of noise in the dataset while reducing imbalance, it is expected that the class boundaries between active and inactive compounds can be better defined. A reduction in noisy instances can also reduce the chance of over-fitting. This is essentially where the power of SMN lies. However, further increments in the k value beyond the optimum led to a decline in classifier performance.

Performance evaluation metrics

Table 2 reports nine performance metrics and their average values for four classification methods (RF, RUS, SMO and SMN) for 12 bioactivity assays, with the best performer highlighted in bold for each evaluation metric and assay. The derived specificity results are reported alone with sensitivity and SSG results in Additional file 1: Table S1. For each assay, the training dataset was employed to train a classifier using four different algorithms, and then the trained classifier was applied to the test dataset to determine performance metrics as described in the “Materials and methods” section (also see Fig. 1). The reported values varied greatly depending on metrics, assays and algorithms. For instance, AUROC has the highest values averaged at 0.8049, whereas MCC has the lowest mean value of 0.2945. This is not surprising as different metrics measure different aspects of learning algorithm performance and trained model quality [61].

Table 2 Nine metrics for evaluating the performance of four classification methods (RF, RUS, SMO and SMN) with twelve Tox21 qHTS assay datasets

We excluded accuracy (the ratio of correct predictions to the total number of chemicals) and specificity from the metrics panel presented in Table 2 because accuracy may be misleading in evaluating model performance for highly imbalanced classification [22]. Specifically, a high accuracy does not translate into a high capability of the prediction model to correctly predict the rare class, whereas specificity is less relevant since we are more interested in the positive class (active minority). However, the nine chosen metrics in the panel are not necessarily the ideal ones for evaluating the performance of classification with a skewed class distribution. For instance, both AUROC and AUPRC can provide a model-wide evaluation of binary classifiers [27]. Although AUROC, proposed as an alternative to accuracy [22], is unaffected by data skewness [62], it may provide an excessively optimistic view of an algorithm’s performance on highly imbalanced data [21]. AUPRC, on the other hand, is affected by data imbalance [62], but it is a more informative and more realistic measure than AUROC for imbalanced classification [27]. Another example is precision and recall, both of which depend on a threshold selected to determine if a chemical compound is active or inactive. A higher recall may be obtained by setting a lower threshold (increasing the number of TP predictions and decreasing the number of FN predictions), which results in a lower precision (more FP predictions). On the other hand, raising the threshold for labeling active chemicals may benefit precision but hurt recall. Optimizing both precision and recall occurs with a tradeoff, especially with imbalanced data. F1 score appears to be a balanced trade-off between precision and recall. Nevertheless, like AUPRC, F1 score is also attenuated by data skewness [62]. SSG, a good indicator of balance between sensitivity and specificity [13], may become an inefficient performance metric when both sensitivity and specificity are low. For such applications as predictive toxicology and drug discovery, one may be more interested in improving sensitivity instead of reducing SSG due to the rarity of positive instances. Given the pros and cons of these metrics, it is necessary to use a suite of metrics for performance evaluation. Hence, we calculated the “average” of the nine metrics (Table 2) which may serve as a comprehensive indicator of model performance. However, its formula (e.g., membership composition, weight of each component metric, and normalization method) and applicability still require further investigation.

Impact of imbalance ratio on performance metrics

The variation in the same performance metrics between different assay datasets is as high as 87% CV (Table 2), suggesting that dataset properties (IR in particular) have a significant impact. Nevertheless, systematic assessment of the impact of IR on prediction accuracy remains a challenging problem. The IRs in our assay datasets varied from 5 to 70 (Table 1). We calculated correlation coefficients (CCs) between log2(IR) and the score of five evaluation metrics (Table 3). Except for the CCs between AUROC and RF/RUS/SMO, there exists a significant negative correlation between IR (of the test datasets) and the performance evaluation metrics F1 score, MCC, BA, AUPRC, AUROC, and the average of all 9 chosen metrics. This is consistent with earlier reports on the adverse effects of IR on these metrics [62]. The statistically significant positive correlation between IR and SSG suggests that higher IRs would increase SSG, which is also undesirable.

Table 3 Correlation coefficients (CCs) between log2IR and six performance metrics plus the average of nine metrics in Table 2 for all four classification algorithms

To investigate how IR affects the extent of performance improvement obtained by different resampling techniques, the scores of four metrics (F1 score, MCC, SSG and the average of 9 metrics) of all twelve assays are plotted against their log2IR (see Fig. 5). For MCC, F1 score and the average of 9 metrics, the trend line of SMN is well above those of SMO, RUS and RF, indicating that SMN performed better than other classifiers. The trend lines of SMO and RUS intertwine with that of RF, suggesting that both SMO and RUS did not consistently improve the performance metrics over the base classifier RF. In addition, the SMN trend line intercepts with the other three at about log2IR = 4.8 (for average), 5.5 (for MCC) or 6.1 (for F1 score), suggesting that a metric-specific IR between 28 and 70 is likely the threshold at which SMN can outperform other classifiers. The lower the IR value is, the more improvements SMN can achieve, compared to the RF, RUS and SMO classifiers. When IR approaches the threshold, the improvements are insignificant. These results demonstrate the limitation of data rebalancing techniques and also provide useful feedback for data acquisition. If evaluated by the SSG metric (the smaller, the better), RUS outperformed SMN and the other two algorithms, suggesting that SMN had limited power in narrowing the gap between sensitivity and specificity. Whenever possible, we should increase the number of active compounds to reduce the imbalance ratio in order to obtain more accurate predictions in SAR-based chemical classification.

Fig. 5
figure 5

The relationship between imbalance ratio (Log2IR) and prediction performance metrics calculated for four classification methods (SMN, SMO, RUS and RF): a F1 score, b MCC, c SSG, and d the average of 9 metrics

Impact of resampling techniques on classifier performance

The effect of algorithm choice is partially reflected by a change of 0.1263 in the average metrics score from RF (0.1854) to SMN (0.3116) (Table 2). We also calculated the average Friedman ranking of each classifier [55] by ranking the four algorithms from 1 to 4 based on their performance on each assay dataset. The best classifiers were assigned a rank of 1 and the worst classifiers were assigned a rank of 4. The algorithm with the lowest average rank is considered the best for a specific metric. As shown in Fig. 6, SMN outperformed the other algorithms (RF, RUS and SMO) in terms of four metrics (F1 score, AUPRC, AUROC and MCC) and was only slightly surpassed by the frontrunner RUS for the BA metric. Taking F1 score as an example, SMN performed better in seven of the 12 assay datasets, followed by RUS which was the best performer for three assays (Table 2). More interestingly, the magnitude of improvement offered by SMN from the next best method ranged from approximately 8% for the NR-ER-LBD dataset to as much as 27% for the SR-ARE and NR-Aromatase datasets. Understandably, the baseline classifier RF had the worst average performance even though its parameters were also optimized. SMN demonstrated a better F1 score in most cases because of its ability to improve recall without excessively lowering precision. A moderately higher recall value with comparable precision positively impacts the F1 score.

Fig. 6
figure 6

Average Friedman ranks of the four classification methods (RF, RUS, SMO and SMN) with respect to five metrics (F1 score, AUPRC, AUROC, MCC and BA). Error bars represent standard errors. See Table 4 for statistical significance in the difference between classifiers

The Friedman’s Aligned Rank Test for Multiple Comparisons [55] was performed to further examine the statistical significance of the algorithmic effects of resampling techniques. Our null hypothesis was that all four algorithms had similar capability in classification measured by nine metrics for 12 datasets. Results shown in Table 4 suggest that all metrics except AUPRC were significantly affected by the resampling algorithm (p < 0.05). The Bergmann-Hommel post hoc analysis was applied to compare pairwise performance metrics of SMN against the other three classifiers. SMN differed more from RF than from SMO and RUS because one, two, and five metrics were insignificantly different (p > 0.05) between SMN and RF, SMN and SMO, and SMN and RUS, respectively. F1 score, MCC and Brier score showed significant difference among the four classifiers in both multiple and pair-wise comparisons. For instance, SMN had the lowest average Brier score of 0.3312 ± 0.0509 (average ± standard error) in comparison with SMO (0.4109 ± 0.0627), RUS (0.3894 ± 0.0361), and the baseline classifier RF (0.3967 ± 0.0395). A lower Brier score indicates that the predictions of a classifier are more accurate because they are closer to the ground truth. MCC, a metric widely used to evaluate the performance of SAR-based chemical classification [63, 64], embodies all the components of the confusion matrix and hence presents a reliable summary of the performance of models trained on imbalanced data.

Table 4 Friedman’s aligned rank test and Bergmann-Hommel post hoc analysis results showing corrected p-values for multiple and pair-wise comparisons between SMN and the other three classifiers, respectively

On the contrary, AUPRC was the sole metric that did not differ significantly in any of the comparisons. AUPRC computes the area under the precision-recall curve that is obtained by using the output of the precision function at different recall levels to assess the overall performance of a prediction model [51]. SMN showed improved AUPRC scores compared to the other algorithms. However, this improvement was not very substantial. Unlike F1 score, which benefits from a varied classification threshold, minor improvements in the probabilities for each class do not translate to a marked improvement in the AUPRC score. This is because, being a threshold-independent metric, AUPRC computes the entire area under the curve for the plot of precision versus recall at all possible thresholds. Nevertheless, SMN still showed the best performance in 33% (4/12) of cases tested, RF and SMO in 25% (3/12) each, and RUS in 16% (2/12).

The above results suggest that AUPRC is not sensitive to algorithmic effects, whereas F1 score, MCC and Brier score are sensitive metrics that can distinguish among the classifiers by their performance. These results also indicate that SMN was the best performer, followed by RUS, while SMO and RF had the poorest performance with the Tox21 datasets. When looking at the average of all 9 metrics (Table 2), SMN and RUS ranked the best for 6 and 5 assays, separately, whereas RF only had the best performance with the NR-AR assay and SMO always underperformed across all 12 assays. These results led us to speculate that the activity landscape of the majority class (inactive compounds) may be more continuous and smooth than that of the minority class (active compounds) [65]. Consequently, removing some instances from the majority class would not affect class boundaries. On the contrary, adding synthetic instances to the minority class (SMOTE) may introduce noise along the borderlines, leading to the loss of activity cliffs and mislabeling of the synthetic instances [66]. The ENN algorithm may effectively remove those synthetic outliers and restore the activity cliffs and class boundaries, leading to enhanced prediction performance for SMOTEENN (SMN) [67]

Comparison with Tox21 Data Challenge winners

In this section, we compared the prediction performance of the four classifiers in this study with those developed by the winning teams for each of the assays in the Tox21 Data Challenge [34]. The winning team for each sub-challenge was judged by AUROC (and BA if there was a tie in AUROC [35]). The AUROC and BA scores of the top ten ranked teams are posted at (https://tripod.nih.gov/tox21/challenge/leaderboard.jsp). The 12 assay sub-challenges were won by four teams: Bioinf@JKU, Amaziz, Dmlab and Microsomes. Bioinf@JKU developed DeepTox models using deep learning [25] and won six out of the 12 assay sub-challenges (NR-AhR, NR-AR-LBD, NR-ER, NR-PPAR-γ, SR-ARE, and SR-HSE) in addition to the Grand Challenge and two additional sub-challenges for the Nuclear Receptor Panel and the Stress Response Panel. Amaziz [68] employed associative neural networks to develop winning models for SR-ATAD5 and SR-MMP assays, and had the best overall BA score. Dmlab [58] used multi-tree ensemble methods, such as Random Forests and Extra Trees, to produce winning models for three assays (i.e., NR-AR, NR-aromatase and SR-p53). Microsomes [59] chose Random Forest for descriptor selection and model generation, and produced the best performing NR-ER-LBD model. For the purpose of comparison, we selected Dmlab and Microsomes because they used Random Forest. We also compared our best classifier with the winner of each assay sub-challenge. Given the over-optimistic nature of AUROC, the BA metric provides a more realistic and reliable measure for performance comparison. The titles of the best BA scores were shared by five teams: Kibutz (1 assay), Bioinf@JKU (2), Amaziz (2), T (3), and StructuralBioinformatics@Charite (4). The AUROC and BA scores of the winning teams are shown in Table 5 side by side with those of our best performing classifiers because they are the only metrics available for the Tox21 Data Challenge.

Table 5 Comparison between this study and Tox21 Data Challenge winners in terms of the classification performance metrics AUROC and balanced accuracy

Although the AUROC and BA metrics are not ideal for evaluating imbalanced classification, we made the comparison to demonstrate that the improvement obtained from imbalance pre-processing enabled our classifiers to perform equally well or outperform the winning models of the Tox21 Data Challenge. This is primarily reflected by the following observations: (1) our best classifiers outperformed Dmlab and Microsomes in terms of both AUROC and BA by large margins with only four exceptions (NR-AR, NR-PPAR-γ, SR-ATAD5 and SR-MMP), where Dmlab exceeded our best classifiers in AUROC by less than 4%; (2) our best classifiers had the same or higher AUROC and a higher BA than challenge winners for six and three assays, respectively, with less than 8% (AUROC) or 17% (BA) difference for the remaining assays; and (3) on average, our best classifiers performed almost equally as well as the challenge winners as a whole (Table 5). The last two columns in Table 5 report the comparison between our best classifier and the winner of Tox21 Challenge in terms of BA and AUROC ratios, with a value greater than 1 indicating that our model performed better than the Challenge winning model. These results (particularly the BA scores) not only establish the validity, credibility and scientific soundness of the approach, methodology and algorithms implemented in this study, but also demonstrate that the excellence of our work reached levels comparable to that of the Tox21 Data Challenge winners.

It is also worth noting that Banerjee et al. [13] performed similar work on three Tox21 datasets (AhR, ER-LDB, and HSE). They employed RF as the base classifier (without ensemble learning) and applied eight different undersampling or oversampling techniques (including random undersampling and SMOTE). Similar to this study, their work also demonstrated that dataset and resampling techniques had significant impacts on classification outcome and that such impacts varied from one metric to another with sensitivity and F-measure being more sensitive than AUROC and accuracy.

Another study worth mentioning described how Norinder and Boyer [16] achieved balanced prediction performance with sensitivity and specificity (for the external test dataset) both attaining 0.70 − 0.75 when they applied MCP to the similar ToxCast and Tox21 datasets of estrogen receptor assays and used SVM as the classifier. These results are far superior to those obtained using SVM or RF alone without resampling or MCP [16, 69], but they are only slightly better than the performance of RUS with sensitivity at 0.69 or 0.55 (Table 2) and specificity at 0.61 or 0.84 (Additional file 1: Table S1) obtained in our study. Therefore, it warrants further in-depth investigations to compare side-by-side resampling with MCP and MCP + resampling using the same machine learning algorithms, the same raw datasets, and the same preprocessing procedure.

Conclusions

Due to the specificity of toxicant-target biomolecule interactions, SAR-based chemical classification studies are often impeded by the imbalanced nature of many toxicity datasets. Furthermore, class boundaries are often blurred since active toxicants often appear in the minority class. In order to address these issues, common resampling techniques can be applied. However, removing majority class instances using an undersampling technique can result in information loss, whereas increasing minority instances by interpolation tends to further obfuscate the majority class space, giving rise to over-fitting. In order to improve the prediction accuracy attained from imbalanced learning, SMOTEENN, a combination of SMOTE and ENN algorithms, is often employed to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. Here, we integrated an ensemble approach (bagging) with a base classifier (RF) and various resampling techniques to form four learning algorithms (RF, RUS, SMO and SMN). Then, we applied them to the binary classification of 12 highly imbalanced Tox21 in vitro qHTS bioassay datasets.

We generated multiple sets of chemical descriptors or fingerprints and down-selected small groups of features for use in class prediction model generation. After data preprocessing, parameters were optimized for both resampling and classifier training. The performance of the four learning methods was compared using nine evaluation metrics, among which F1 score, MCC and Brier score provided more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. It was also found that there was a strong negative correlation between prediction accuracy and IR. We observed that SMN became less effective when IR exceeded a certain threshold (e.g., > 28). Therefore, SAR-based imbalanced learning can be affected by the degree of dataset skewness, resampling algorithms, and evaluation metrics. We recommend assembling a panel of representative, diversified and imbalance-sensitive metrics, developing a comprehensive index from this panel, and using the index to evaluate the performance of classifiers for imbalanced datasets.

The ability to separate the small number of active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through imbalance handling. Although the best classifiers of this study achieved the same level of performance as the winners of the Tox21 Data Challenge as a whole, we believe that there is still plenty of room for further improvement. Given the exceptionally outstanding performance of DeepTox [25] and our own experience with deep learning-based chemical toxicity classification [70], our future plan is to replace RF with a deep learning algorithm like deep neural networks as the base classifier and combine it with class rebalancing techniques to build novel deep learning models for SAR-based chemical toxicity prediction. We are also interested in pursuing a novel approach by integrating MCP, resampling and ensemble strategies to further improve the robustness and performance of imbalanced learning.