Abstract
Machine intelligence (MI), including machine learning and deep learning, have been regarded as promising methods to reduce the prohibitively high cost of drug development. However, a dilemma within MI has limited its wide application: machine learning models are easier to interpret but yield worse predictive performance than deep learning models. Therefore, we propose a pipeline called Class Imbalance Learning with Bayesian Optimization (CILBO) to improve the performance of machine learning models in drug discovery. To demonstrate the efficacy of the CILBO pipeline, we developed an example model to predict antibacterial candidates. Comparison of the antibacterial prediction performance between our model and a well-known deep learning model published by Stokes et al. suggests that our model can perform as well as the deep learning model in drug activity prediction. The CILBO pipeline we propose provides a simple, alternative approach to accelerate preliminary screenings and decrease the cost of drug discovery.
Similar content being viewed by others
Introduction
Drug development is a costly process that takes more than 10 years and up to an average of 2.6 billion USD to bring a drug from preliminary discovery to the market1,2,3,4,5. Much of this cost is attributed to the high attrition rate of the failure trials of the candidate drugs3,4. Despite substantial investment in the selection of candidate molecules from hundreds or thousands of compounds, the success rate for new candidates to finally reach the market is approximately 10%4. Therefore, selecting the most promising candidate molecules will help to accelerate the research process and reduce final stage failure, thereby minimizing the cost of drug development.
Machine intelligence (MI), including machine learning and deep learning, has been successfully applied to drug discovery and is regarded as a promising method for such candidate selection6,7. However, there is a dilemma between performance and interpretability within MI8, which has limited its application. Previous studies have shown that deep learning models perform better than machine learning models on classifications9,10 but are harder to interpret11. The reason deep learning models are more difficult to interpret is that it is harder to find direct and reliable correlations between features the models utilized in classification and the output predictions11. These limitations are unfavourable in drug development because researchers in the field would prefer to acquire not only the capacity for prediction but also the knowledge suggested by the model12.
Developing methods to interpret the outcomes of deep learning models is no simple task12,13,14. Instead, improving the performance of machine learning models may offer a faster and easier solution to alleviate the dilemma of predictivity and interpretability. For instance, automatic machine learning (AutoML) is one promising strategy to enhance the drug development process.
Models built on automatic machine learning have been verified in healthcare to have relatively good performance15,16. Even the deep learning model's performance sometimes relies on the network structure suggested by automatic machine learning10. Under the AutoML algorithm, various hyperparameters or ensemble methods can be automatically tried tens to hundreds of times to improve the performance of the machine learning model. Meanwhile, the AutoML algorithm frees the researcher from these tedious, and repetitive trial and error as well17. Besides, training a machine learning model generally takes orders of magnitude less time than training a deep learning model. Machine learning models built on AutoML can be easily trained with many more hyperparameters in an acceptable training time.
Additionally, drug discovery datasets are usually highly imbalanced, that contain very few functional candidates (interested class) and hundreds or thousands of times more unfunctional molecules (uninterested class)18,19. The minority interested class is more likely to be predicted as a rare occurrence, ignored altogether, or assumed as noise or an outlier, which causes bias and leads to poor generalization performance20. Although several previous studies in disease and drug related fields proved that addressing class imbalance problem appropriately would improve the model’s performance21,22,23, this problem has still been frequently ignored. Once the machine learning models apply strategies to minimize classification bias caused by the imbalanced datasets, it may further benefit the performance improved by the AutoML algorithm.
Therefore, we suggest a pipeline of constructing a machine learning model using Bayesian Optimization with strategies for imbalanced datasets to improve the classification performance of this model on drug discovery. In this pipeline, the best hyperparameter combination of the model variables, training and treating imbalanced datasets is suggested with Bayesian optimization. Bayesian optimization is a sequential design strategy for global optimization of black-box functions that does not assume any functional forms, and it seems particularly well-positioned for the application areas like drug development24,25. The optimization used here differs from commonly used hyperparameter optimization by addressing the issue of class imbalance. We call this pipeline Class Imbalance Learning with Bayesian optimization (CILBO). Antibiotic predictions are used here as an example to evaluate the effectiveness of this pipeline.
In the past decades, antibiotic screening has become prohibitively costly and has decreased dramatically10,26,27,28,29. This situation, together with widespread antibiotic resistance, makes the discovery of new antibiotics critical10,26,30. A recent work done by Stokes et al. showed the successful discovery of new antibiotics through the combination of a graph neural network (GNN) model prediction and empirical investigation10. Stokes’ research highlighted the significant impact of deep learning on early antibiotic discovery, but interpreting the results generated by their model is still not an easy task.
Here, using the same datasets described in the paper by Stokes et al.10, we generated a random forest classifier with CILBO and compared the predictions of our model to those of Stokes’ model10. The comparison suggests that our machine learning model can perform predictions as well as the deep learning model. The CILBO pipeline is simple to run and able to efficiently improve the performance of machine learning models, offering an alternative approach that can be widely applied in many fields of drug discovery.
Results
Summary of our example model construction
To improve the classification performance of the easy-to-interpret machine learning models on drug discovery, we built a pipeline called Class Imbalance Learning with Bayesian Optimization (CILBO). This pipeline uses Bayesian optimization to suggest the best combination of hyperparameters for model variables, training, and treating imbalanced datasets of a machine learning model. We constructed a random forest classifier based on CILBO as an example to verify CILBO’s effectiveness on improving model performance in drug discovery (Fig. 1).
Evaluation of our model performance during the training phase
We trained our model on the same datasets used in Stokes’ GNN model10. The dataset used for training contained 2335 molecules, but only 120 of them had proved antibacterial activity (see Supplementary Table S1). The dataset was highly imbalanced. To compare the performance and outcomes of our model with Stokes’ deep learning model, we used the same training dataset and training/testing ratio (see “Methods” section).
The best hyperparameters suggested by Bayesian optimization, which were used in this study, are listed in Table 1. The last two hyperparameters, “class_weight” and “sampling_strategy” were utilized to minimize the bias caused by dataset imbalance and enhance the overall model performance. The feature “fingerprint” providing the descriptions of the topological structure representations was very useful for interpreting the model. The RDK fingerprint computed by RDKit31 was selected as the molecule feature for our model because it behaved optimally compared to other optional features (for descriptors and other fingerprints, see “Methods” section).
With these best hyperparameters and feature, the average receiver operating characteristic curve-area under the curve (ROC-AUC) of our model, after 30 times fivefold cross-validation at the training phase, was about 0.917 (Supplementary File random_forest_antibiotic_20210511c.py). It was greater than 0.896, by the ROC-AUC of Stokes’ model10. After being enhanced by the best hyperparameters and molecule features described above, and with more training samples (the training set includes 90% of molecules), our final model achieved a ROC-AUC of 0.99 (Fig. 2). The confusion matrix based on the test set of our final model is shown in Table 2. According to this matrix, our model did not classify any non-antibacterial molecules as antibacterial, which suggests that the model has a low false-positive rate for identifying candidate compounds.
Comparison of the prediction results generated by our model and Stokes’ model in antibacterial discovery
Our final model was then applied to identify candidate molecules with antibacterial characteristics from the library recorded by the Drug Repurposing Hub10,32. This library contained 6111 molecules at different stages of investigation for human diseases. After removing compounds with molecular graphs common between the training dataset (2335 molecules) and the Drug Repurposing Hub (6111 molecules), a dataset containing 4496 unique molecules was used for the identification. Importantly, this dataset was also the same set of molecules used by Stokes et al. to demonstrate their model10. For each molecule, our model provides a prediction score. The full list of prediction results can be found in Supplementary Table S2.
Using their model, Stokes and his colleagues selected the top 99 molecules most strongly predicted to display antibacterial properties, along with the 63 molecules with the lowest prediction scores. To validate this finding, Stokes et al. empirically tested these 162 (99 + 63) molecules10. We leveraged Stokes’ empirically tested antibacterial information to estimate our own prediction results across the same set of 162 molecules (Supplementary Table S3). We found that our model was comparably effective as the Stokes’ model in its prediction of the 162 molecules with antibacterial properties. (Fig. 3).
Specifically, about 75% of the molecules with prediction scores above 0.5 in our model (model default threshold score) were found to be empirically tested antibacterials, while 74% of the molecules with prediction scores above 0.5 in the Stokes’ model were empirically tested antibacterials. When a threshold score above 0.5 was selected in both models, about 80% of molecules meeting the criteria had empirically tested antibacterial properties, which was higher than the number by any model alone. If requires 90% of the empirically tested antibacterial molecules were screened out, the score threshold of our model and Stokes’s model would be 0.15 and 0.3 respectively based on this comparison (Fig. 3). Of the molecules that scored near zero, almost all were absent of antibacterial properties in both models. Again, this finding further emphasizes that our model is comparable with the Stokes’ model in its prediction capacity.
Discussion
Machine Intelligence (MI) has been regarded as a promising approach to help relieve the pressure of prohibitively costly procedures during drug discovery. However, the dilemma between predictivity and interpretability within MI has limited broader applications in drug discovery. Therefore, we suggested a pipeline Class Imbalance learning with Bayesian Optimization (CILBO) to improve the machine learning model’s classification performance.
We verified the performance of an example model built based on CILBO for candidate antibiotic discovery. Here, a random forest classifier constructed using Bayesian optimization with strategies for imbalanced datasets was applied as the example model. The random forest classifier was adopted because it limits overfitting and is easy to interpret.
By using the same datasets and similar training conditions described in Stokes’ paper10 (see “Methods” section), our model was comparable to Stokes’ model. Based on the same splitting ratio of training and testing sets, the average ROC-AUC of our model could reach 0.917, which is notably higher than 0.896 the ROC-AUC of Stokes’ model. Meanwhile, the results of imbalanced random forest model with empirical hyperparameters and xgboost were also compared with our model, and the average ROC-AUC calculated based on them were 0.895 and 0.901 respectively (see the Supplementary File “machine_learning_model_20211208.py”). This finding indicates that the performance of our model is not inferior to that of Stokes’ deep learning model10 based on the comparatively simple library we used for training.
When directly comparing the prediction results of two models among the 162 empirically tested molecules, our model also successfully captured promising candidates at a rate comparable to Stokes’ deep learning model. These results all suggest that our model can perform as well as deep learning models in early screenings for drug discovery.
A key natural benefit of the CILBO model is that it is easy to interpret. Furthermore, the time required to train a machine learning model like the random forest model used here is at least 100 times shorter than the time required to train a deep learning model. It provides enough time to try various hyperparameters automatically and identify the best ones to enhance the model. Another obvious advantage of the model built on CILBO compared to the normal deep learning model is that CILBO is less dependent on the model designer and infrastructure. To design a neural network model, the researcher needs a rich experience and a high-end machine. The random forest model is simpler to design since the key characters are provided by Bayesian optimization, an AutoML method. Therefore, the simplicity and efficiency of our model will make it applicable in a wider context of drug candidate selection.
Of particular note is that imbalanced datasets are fairly common in the drug industry. They may cause severe classification bias during MI aided drug development, but this problem has been frequently ignored. We considered this imbalance problem in the construction of our model and used special hyperparameters to control for this type of bias. It is designed to enhance the performance of our model.
In our work, we constructed a special random forest model using CILBO (a pipeline we proposed), and compared this machine learning model to the deep learning model created by Stokes et al.10 in antibiotic discovery. The comparison results together with other properties of our model suggest that: (1) the machine learning model built on CILBO can perform prediction at least as well as the deep learning model; (2) it is naturally easier to interpret and comparatively simpler to operate without requiring a high level of researcher experience, (3) inclusion of strategies for class imbalance to control classification bias further improved the predictive performance of the model built on CILBO, and may broaden its applicability in drug development. Therefore, CILBO, the pipeline we designed provides an alternative and simple solution to promote MI in drug development.
Methods
Model selection
We used a random forest model as the classifier in this work (Fig. 1). A random forest model is robust against overfitting33 and easy to interpret because routine methods can be used to estimate the importance of and interaction between features34.
Molecule feature selection
The optional molecule features include descriptors, RDK fingerprint, MACCS key, Avalon fingerprint, ECFP4 and ECFP6, which were calculated by RDKit 2020.09.1.031. All of these features were tested separately during the training phase of our model, but not the complex combination of features. Descriptors usually provide information on different molecular properties, such as general physical properties, electrochemical properties and electron cloud characteristics. Fingerprint provides descriptions of various topological structure representations of molecules. RDK fingerprint was chosen for the final model. RDK fingerprint calculated for the dataset used to train our model (including 2335 unique compounds) and the dataset used to predict candidates (including 4496 unique compounds) can see in Tables S4 and S5 respectively.
Hyperparameters optimization
Bayesian optimization was used to find the best hyperparameters to our model. In this work, Bayesian optimization was used not just for the classifier, but also for the strategies specially dealing with imbalanced datasets. As we observed, the training dataset was highly imbalanced, which may introduce classification bias. Bayesian optimization was supposed to give the best combination of hyperparameters for the classifier and to mitigate the problem caused by class imbalance. This is distinct from common models using an automatic machine learning algorithm. In most cases, only classifier hyperparameters and training details of the models are focused, while strategies to deal with imbalanced datasets are frequently ignored.
Specifically, Python packages, scikit-learn 0.23.2, imbalanced-learn 0.7.0 and scikit-optimize 0.8.1 were used to search for the best hyperparameters. For convenience, the BalancedRandomForestClassifier defined in imbalanced-learn 0.7.0 package was adopted, since it included the sampling strategy and class weight. In our model, the maximum feature number was the default value in BalancedRandomForestClassifier. The hyperparameters that needed to be optimized were as follows:
Hyperparameters | Value type (range) |
---|---|
n_estimators | Integer (5, 5000) |
Criterion | Categorical ([“gini”, “entropy”]) |
max_depth | Integer (1, 6000) |
min_samples_split | Integer (2, 200) |
min_samples_leaf | Integer (1, 200) |
Bootstrap | Categorical ([True, False]) |
class_weight | Categorical ([“balanced”, “balanced_subsample”, None]) |
sampling_strategy | Categorical ([‘majority’, ‘not minority’, ‘not majority’]) |
The hyperparameters most frequently used in random forest models included: “n_estimators”, “criterion”, “max_depth”, “min_samples_split”, “min_samples_leaf” and “bootstrap”. The last two hyperparameters, “class_weight” and “sampling_strategy”, were specifically used to handle imbalanced datasets.
In Bayesian searching, the ROC-AUC was the metric used for the random forest classifier. BayesSearchCV from the package scikit-optimize was the function for Bayesian searching, and the surrogate model was used by default. The cross-validation folds were created by StratifiedFold in scikit-learn package. The splitting number was set to 5 and the iteration number was 150. These numbers were set according to the relative description in Stokes’ paper10 for a better comparison. Once the best hyperparameters were identified, an additional cross validation was performed. The final model was then trained with the best hyperparameters suggested by Bayesian optimization.
Datasets for training, analysis and direct comparison
The dataset used to train our model was the same dataset described in Stokes’ paper10. This dataset combined the molecules from the USFDA-approved Drug Library and those isolated from natural products. It contained 2335 unique compounds, 120 of which had growth inhibitory activity against E. coli (see Supplementary Table S1). The dataset was highly imbalanced. For a fair comparison, the splitting ratio of training and testing sets used in our model were the same as the Stokes model10. It was about 80%/20% (training/testing) for the cross validation, and approximately 90%/10% for the final model training.
The dataset we used for candidate prediction was also the same dataset described by Stokes’10. It was a dataset from Drug Repurposing Hub10,32 containing 6111 molecules at various stages of investigation for human diseases. By removing the compounds with molecule graphs common between the training set and the Drug Repurposing Hub, the remaining 4496 molecules were used for both our model and Stokes’ model prediction (see Supplementary Table S2). Of these molecules, 99 were predicted the most effectively and 63 were predicted the most ineffectively by the Stokes deep learning model (see Supplementary Table S3). These 162 (99 + 63) molecules were empirically tested for growth inhibition against E. coli by Stokes and his colleagues, and 53 (51 out of 99 and 2 out of 63) of them were found to have antibacterial properties.
The dataset used for direct comparison of prediction results between our model and Stokes’ model was based on the 162 molecules with empirically tested antibacterial information.
References
Mullard, A. New drugs cost US$2.6 billion to develop. Nat. Rev. Drug Discov. 13, 877 (2014).
Berdigaliyev, N. & Aljofan, M. An overview of drug discovery and development. Future Med. Chem. 12(10), 939–947 (2020).
Ja, D., Hg, G. & Rw, H. The cost of drug development. N. Engl. J. Med. 372, 1972–1972 (2015).
Tamimi, N. A. M. & Ellis, P. Drug development: From concept to marketing!. Nephron. Clin. Pract. 113(3), 125–131 (2009).
Dickson, M. & Gagnon, J. P. The cost of new drug discovery and development. Discov. Med. 4, 172–179 (2009).
Zhang, L., Tan, J., Han, D. & Zhu, H. From machine learning to deep learning: Progress in machine intelligence for rational drug discovery. Drug Discov. Today 22, 1680–1685 (2017).
Carpenter, K. A., Cohen, D. S., Jarrell, J. T. & Huang, X. Deep learning and virtual drug screening. Future Med. Chem. 10, 2557–2567 (2018).
Wu, L. et al. Trade-off predictivity and explainability for machine-learning powered predictive toxicology: An in-depth investigation with Tox21 data sets. Chem. Res. Toxicol. 34, 541–549 (2021).
Korotcov, A., Tkachenko, V., Russo, D. P. & Ekins, S. Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery datasets. Mol. Pharm. 14, 4462 (2017).
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702 (2020).
London, A. J. Artificial intelligence and black-box medical decisions: Accuracy versus explainability. Hastings Cent. Rep. 49, 15–21 (2019).
Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020).
Choo, J. & Liu, S. Visual analytics for explainable deep learning. IEEE Comput. Graph. Appl. 38, 84–92 (2018).
Samek, W., Wiegand, T. & Müller, K.R. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. http://arXiv.org/1708.08296 (2017).
Waring, J., Lindvall, C. & Umeton, R. Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artif. Intell. Med. 104, 101822 (2020).
Delahanty, R. J., Kaufman, D. & Jones, S. S. Development and evaluation of an automated machine learning algorithm for in-hospital mortality risk adjustment among critical care patients. Crit. Care Med. 46, e481–e488 (2018).
Tuggener, L. et al. Automated machine learning in practice: State of the art and recent results. In The 6th Swiss Conference on Data Science, SDS 2019, 31–36. http://arXiv.org/1907.08392 (2019).
Yap, B. W. et al. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. Lect. Notes Electr. Eng. 285, 13–22 (2014).
Ebenezer, A. B., Boyinbode, O. K. & Idowu, O. M. A comprehensive analysis of handling imbalanced dataset. Int. J. Adv. Trends Comput. Sci. Eng. 10, 454–463 (2021).
Zheng, Z., Cai, Y. & Li, Y. Oversampling method for imbalanced classification. Comput. Inform. 34, 1017–1037 (2015).
Anaissi, A., Kennedy, P. J., Goyal, M. & Catchpoole, D. R. A balanced iterative random forest for gene selection from microarray data. BMC Bioinform. 14, 261 (2013).
Khalilia, M., Chakraborty, S. & Popescu, M. Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inf. Decis. Making 11, 51 (2011).
Shi, H. et al. Predicting drug-target interactions using Lasso with random forest based on evolutionary information and chemical structure. Genomics 111, 1839–1852 (2019).
Mockus, J. Bayesian Approach to Global Optimization (Kluwer Academic Publishers, 1989).
Frazier, PI. A tutorial on Bayesian optimization (2018). http://arXiv.org/1807.02811.
Bérdy, J. Thoughts and facts about antibiotics: Where we are now and where we are heading. J. Antibiotechnol. 65, 385–395 (2012).
Katz, L. & Baltz, R. H. Natural product discovery: Past, present, and future. J. Ind. Microbiol. Biotechnol. 43, 155–176 (2016).
Atanasov, A. G., Zotchev, S. B., Dirsch, V. M. & Supuran, C. T. Natural products in drug discovery: Advances and opportunities. Nat. Rev. Drug Discov. 20, 200–216 (2021).
Spellberg, B., Powers, J. H., Brass, E. P., Miller, L. G. & Edwards, J. E. Trends in antimicrobial drug development: Implications for the future. Clin. Infect. Dis. 38, 1279–1286 (2004).
Coates, A. R. M. & Hu, Y. Novel approaches to developing new antibiotics for bacterial infections. Br. J. Pharmacol. 152, 1147–1154 (2007).
Landrum, G. RDKit: Open-Source Cheminformatics (2006). http://www.rdkit.org/.
Corsello, S. M. et al. The drug repurposing hub: A next-generation drug library and information resource. Nat. Med. 23, 405–408 (2017).
Liaw, A. & Wiener, M. Classification and regression by randomForest. R News 2(3), 18–22 (2002).
Ishwaran, H. Variable importance in binary regression trees and forests. Electron. J. Stat. 1, 519–537 (2007).
Author information
Authors and Affiliations
Contributions
S.G. conceived of the presented idea. S.G. developed the method and performed all the computations and analyses. N.F. helped with result presenting and discussion. N.F. drafted the manuscript. S.G. and N.F. revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Guan, S., Fu, N. Class imbalance learning with Bayesian optimization applied in drug discovery. Sci Rep 12, 2069 (2022). https://doi.org/10.1038/s41598-022-05717-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-022-05717-7
- Springer Nature Limited