Skip to main content
Log in

Using SMOTE to Deal with Class-Imbalance Problem in Bioactivity Data to Predict mTOR Inhibitors

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Machine learning algorithms give sub-optimal performance in the presence of class-imbalanced dataset. Mammalian target of rapamycin (mTOR) is one of the serine/threonine protein kinase, and plays an integral role in autophagy pathway. Autophagy is a cellular pathway for recycling of macromolecules (proteins, lipids, and organelles), which enables eukaryotic cells to adapt metabolism to survive during adverse growth conditions. Targeting mTOR through therapeutic interventions of autophagy pathway establishes mTOR a promising pharmacological target for autophagy modulation in cancer. The bioactivity dataset of mTOR in ChEMBL, a compound bioactivity database maintained by European Bioinformatics Institute, shows disproportionate distribution of active and inactive classes. The predictive models based on this skewed dataset are biased towards prediction of majority class. Hence, we have used Synthetic Minority Over-sampling TEchnique to deal with class-imbalance problem in bioactivity datasets. We have built and evaluated predictive models based on four commonly used classifiers using both class-imbalanced and class-balanced bioactivity datasets, and compared their performance based on various metrics like accuracy, sensitivity, specificity, F1-measure, and AUC. We observe that the classification models based on balanced dataset generally outperform those that are based on class-imbalanced dataset, irrespective of the classifiers used for classification task. We conclude that predictive models trained over class-balanced dataset can be used for screening large compound bioactivity datasets to predict mTOR inhibitors-like compounds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Bender A. Databases: compound bioactivities go public. Nat Chem Biol. 2010;6(5):309.

    Article  Google Scholar 

  2. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  3. Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Boca Raton: CRC Press; 1984.

    MATH  Google Scholar 

  4. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011;2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  5. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.

    Article  Google Scholar 

  6. Chiarini F, Evangelisti C, McCubrey JA, Martelli AM. Current treatment strategies for inhibiting mtor in cancer. Trends Pharmacol Sci. 2015;36(2):124–35.

    Article  Google Scholar 

  7. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.

    MATH  Google Scholar 

  8. Fabbro D, Cowan-Jacob SW, Moebitz H. Ten things you should know about protein kinases: IUPHAR review 14. Br J Pharmacol. 2015;172(11):2675–700.

    Article  Google Scholar 

  9. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, et al. ChEmbl: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40(D1):D1100–7.

    Article  Google Scholar 

  10. Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2016;45(D1):D945–54.

    Article  Google Scholar 

  11. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl. 2017;73:220–39.

    Article  Google Scholar 

  12. Haykin S. Neural networks: a comprehensive foundation. Englewood Cliffs: Pretice Hall International, Inc.; 1999.

    MATH  Google Scholar 

  13. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2008;73(9):1263–84.

    Google Scholar 

  14. Kim YC, Guan KL. mTOR: a pharmacologic target for autophagy regulation. J Clin Investig. 2015;125(1):25–32.

    Article  Google Scholar 

  15. Li Q, Wang Y, Bryant SH. A novel method for mining highly imbalanced high-throughput screening data in pubchem. Bioinformatics. 2009;25(24):3310–6.

    Article  Google Scholar 

  16. Loh WY. Classification and regression trees. Wiley Interdiscip Rev Data Min Knowl Discov. 2011;1(1):14–23.

    Article  Google Scholar 

  17. Roskoski R Jr. Classification of small molecule protein kinase inhibitors based upon the structures of their drug–enzyme complexes. Pharmacol Res. 2016;103:26–48.

    Article  Google Scholar 

  18. Sun Y, Wong AK, Kamel MS. Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell. 2009;23(04):687–719.

    Article  Google Scholar 

  19. Wang L, Chen L, Liu Z, Zheng M, Gu Q, Xu J. Predicting mTOR inhibitors with a classifier using recursive partitioning and Naïve Bayesian approaches. PloS ONE. 2014;9(5):e95221.

    Article  Google Scholar 

  20. Yap CW. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011;32(7):1466–74.

    Article  MathSciNet  Google Scholar 

  21. Zakharov AV, Peach ML, Sitzmann M, Nicklaus MC. QSAR modeling of imbalanced high-throughput screening data in pubchem. J Chem Inf Model. 2014;54(3):705–12.

    Article  Google Scholar 

  22. Zask A, Verheijen JC, Richard DJ. Recent advances in the discovery of small-molecule ATP competitive mTOR inhibitors: a patent review. Expert Opin Ther Patents. 2011;21(7):1109–27.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Abulaish.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumari, C., Abulaish, M. & Subbarao, N. Using SMOTE to Deal with Class-Imbalance Problem in Bioactivity Data to Predict mTOR Inhibitors. SN COMPUT. SCI. 1, 150 (2020). https://doi.org/10.1007/s42979-020-00156-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-020-00156-5

Keywords

Navigation