Abstract
Machine learning algorithms give sub-optimal performance in the presence of class-imbalanced dataset. Mammalian target of rapamycin (mTOR) is one of the serine/threonine protein kinase, and plays an integral role in autophagy pathway. Autophagy is a cellular pathway for recycling of macromolecules (proteins, lipids, and organelles), which enables eukaryotic cells to adapt metabolism to survive during adverse growth conditions. Targeting mTOR through therapeutic interventions of autophagy pathway establishes mTOR a promising pharmacological target for autophagy modulation in cancer. The bioactivity dataset of mTOR in ChEMBL, a compound bioactivity database maintained by European Bioinformatics Institute, shows disproportionate distribution of active and inactive classes. The predictive models based on this skewed dataset are biased towards prediction of majority class. Hence, we have used Synthetic Minority Over-sampling TEchnique to deal with class-imbalance problem in bioactivity datasets. We have built and evaluated predictive models based on four commonly used classifiers using both class-imbalanced and class-balanced bioactivity datasets, and compared their performance based on various metrics like accuracy, sensitivity, specificity, F1-measure, and AUC. We observe that the classification models based on balanced dataset generally outperform those that are based on class-imbalanced dataset, irrespective of the classifiers used for classification task. We conclude that predictive models trained over class-balanced dataset can be used for screening large compound bioactivity datasets to predict mTOR inhibitors-like compounds.
Similar content being viewed by others
References
Bender A. Databases: compound bioactivities go public. Nat Chem Biol. 2010;6(5):309.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Boca Raton: CRC Press; 1984.
Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011;2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
Chiarini F, Evangelisti C, McCubrey JA, Martelli AM. Current treatment strategies for inhibiting mtor in cancer. Trends Pharmacol Sci. 2015;36(2):124–35.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
Fabbro D, Cowan-Jacob SW, Moebitz H. Ten things you should know about protein kinases: IUPHAR review 14. Br J Pharmacol. 2015;172(11):2675–700.
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, et al. ChEmbl: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40(D1):D1100–7.
Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2016;45(D1):D945–54.
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl. 2017;73:220–39.
Haykin S. Neural networks: a comprehensive foundation. Englewood Cliffs: Pretice Hall International, Inc.; 1999.
He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2008;73(9):1263–84.
Kim YC, Guan KL. mTOR: a pharmacologic target for autophagy regulation. J Clin Investig. 2015;125(1):25–32.
Li Q, Wang Y, Bryant SH. A novel method for mining highly imbalanced high-throughput screening data in pubchem. Bioinformatics. 2009;25(24):3310–6.
Loh WY. Classification and regression trees. Wiley Interdiscip Rev Data Min Knowl Discov. 2011;1(1):14–23.
Roskoski R Jr. Classification of small molecule protein kinase inhibitors based upon the structures of their drug–enzyme complexes. Pharmacol Res. 2016;103:26–48.
Sun Y, Wong AK, Kamel MS. Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell. 2009;23(04):687–719.
Wang L, Chen L, Liu Z, Zheng M, Gu Q, Xu J. Predicting mTOR inhibitors with a classifier using recursive partitioning and Naïve Bayesian approaches. PloS ONE. 2014;9(5):e95221.
Yap CW. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011;32(7):1466–74.
Zakharov AV, Peach ML, Sitzmann M, Nicklaus MC. QSAR modeling of imbalanced high-throughput screening data in pubchem. J Chem Inf Model. 2014;54(3):705–12.
Zask A, Verheijen JC, Richard DJ. Recent advances in the discovery of small-molecule ATP competitive mTOR inhibitors: a patent review. Expert Opin Ther Patents. 2011;21(7):1109–27.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.
Rights and permissions
About this article
Cite this article
Kumari, C., Abulaish, M. & Subbarao, N. Using SMOTE to Deal with Class-Imbalance Problem in Bioactivity Data to Predict mTOR Inhibitors. SN COMPUT. SCI. 1, 150 (2020). https://doi.org/10.1007/s42979-020-00156-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-020-00156-5