Journal of Computer-Aided Molecular Design

, Volume 33, Issue 11, pp 997–1008 | Cite as

Undersampling: case studies of flaviviral inhibitory activities

  • Stephen J. BarigyeEmail author
  • José Manuel García de la Vega
  • Juan A. Castillo-Garit


Imbalanced datasets, comprising of more inactive compounds relative to the active ones, are a common challenge in ligand-based model building workflows for drug discovery. This is particularly true for neglected tropical diseases since efforts to identify therapeutics for these diseases are often limited. In this report, we analyze the performance of several undersampling strategies in modeling the Dengue Virus 2 (DENV2) inhibitory activity, as well as the anti-flaviviral activities for the West Nile (WNV) and Zika (ZIKV) viruses. To this end, we build datasets comprising of 1218 (159 actives and 1059 inactives), 1044 (132 actives and 912 inactives) and 302 (75 actives and 227 inactives) molecules with known DENV2, WNV and ZIKV inhibitory activity profiles, respectively. We develop ensemble classifiers for these endpoints and compare the performance of the different undersampling algorithms on external sets. It is observed that data pruning algorithms yield superior performance relative to data selection algorithms. The best overall performance is provided by the one-sided selection algorithm with test set balanced accuracy (BACC) values of 0.84, 0.74 and 0.77 for the DENV2, WNV and ZIKV inhibitory activities, respectively. For the model building, we use the recently proposed GT-STAF information indices, and compare the predictivity of 3 molecular fragmentation approaches: connected subgraphs, substructure and alogp atom types, which are observed to show comparable performance. On the other hand, a combination of indices based on these fragmentation strategies enhances the predictivity of the built ensembles. The built models could be useful for screening new molecules with possible DENV, WNV and ZIKV inhibitory activities. ADMET modelers are encouraged to adopt undersampling algorithms in their workflows when dealing with imbalanced datasets.

Graphic abstract


Dengue virus West nile virus Zika virus Undersampling Support vector machine Information index 



Dengue virus


West nile virus


Zika virus


Graph Theoretical Thermodynamic STAte Functions Information Index


Quantitative structure–activity relationships



The authors appreciate the reviewers for their valuable comments and taking the time to revise the submitted python scripts.

Supplementary material

10822_2019_255_MOESM1_ESM.xlsx (2.9 mb)
Supplementary material 1—Matrix of variables employed in the ensemble model building and the corresponding descriptions for the adopted nomenclature, chemical compounds comprising imbalanced dataset, as well as the in-house python script employed in the present study. (XLSX 2964 kb) (1.3 mb)
Supplementary material 2 (ZIP 1350 kb) (849 kb)
Supplementary material 3 (ZIP 849 kb)


  1. 1.
    Hotez PJ, Molyneux DH, Fenwick A, Kumaresan J, Sachs SE, Sachs JD, Savioli L (2007) N Engl J Med 357(10):1018CrossRefGoogle Scholar
  2. 2.
    Bhatt S, Gething PW, Brady OJ, Messina JP, Farlow AW, Moyes CL, Drake JM, Brownstein JS, Hoen AG, Sankoh O (2013) Nature 496(7446):504CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Normile D (2013) Science 342(6157):415CrossRefGoogle Scholar
  4. 4.
    Guzman MG, Alvarez M, Halstead SB (2013) Arch Virol 158(7):1445CrossRefGoogle Scholar
  5. 5.
    Capeding MR, Tran NH, Hadinegoro SRS, Ismail HIHM, Chotpitayasunondh T, Chua MN, Luong CQ, Rusmil K, Wirawan DN, Nallusamy R (2014) Lancet 384(9951):1358CrossRefPubMedGoogle Scholar
  6. 6.
    Normile D (2017) Science 358:1514CrossRefGoogle Scholar
  7. 7.
    Behnam MA, Nitsche C, Boldescu V, Klein CD (2016) J Med Chem 59(12):5622CrossRefGoogle Scholar
  8. 8.
    Brito-Sánchez Y, Marrero-Ponce Y, Barigye SJ, Yaber-Goenaga I, Morell Perez C, Le-Thi-Thu H, Cherkasov A (2015) Mol Inform 34(5):308CrossRefGoogle Scholar
  9. 9.
    Barigye SJ, Freitas MP, Ausina P, Zancan P, Sola-Penna M, Castillo-Garit JA (2018) ACS Comb Sci 20(2):75CrossRefPubMedGoogle Scholar
  10. 10.
    Hoens TR, Chawla NV (2013) Imbalanced datasets: from sampling to classifiers. In: Haibo H, Yunqian M (eds) Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press, New Jersey, p 43CrossRefGoogle Scholar
  11. 11.
    Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from imbalanced data sets. Springer, BerlinCrossRefGoogle Scholar
  12. 12.
    He G, Han H, Wang W (2005) An over-sampling expert system for learing from imbalanced data sets. 2005 International Conference on Neural Networks and Brain: IEEE, p 537Google Scholar
  13. 13.
    Newby D, Freitas AA, Ghafourian T (2013) J Chem Inform Model 53(2):461CrossRefGoogle Scholar
  14. 14.
    Gadaleta D, Manganelli S, Roncaglioni A, Toma C, Benfenati E, Mombelli E (2018) J Chem Inform Model 58(8):1501CrossRefGoogle Scholar
  15. 15.
    Zang Q, Rotroff DM, Judson RS (2013) J Chem Inform Model 53(12):3244CrossRefGoogle Scholar
  16. 16.
    Morens DM, Fauci AS (2017) J Infect Dis 216(suppl_10):S857CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Metsky HC, Matranga CB, Wohl S, Schaffner SF, Freije CA, Winnicki SM, West K, Qu J, Baniecki ML, Gladden-Young A (2017) Nature 546(7658):411CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Barigye SJ, Marrero-Ponce Y, Martínez-López Y, Martínez-Santiago O, Torrens F, García-Domenech R, Galvez J (2012) SAR QSAR Environ Res 24(1):3–34CrossRefGoogle Scholar
  19. 19.
    Barigye SJ, Marrero-Ponce Y, Alfonso-Reguera V, Pérez-Giménez F (2013) Chem Phys Lett 570:147CrossRefGoogle Scholar
  20. 20.
    Barigye SJ, Marrero-Ponce Y, Martínez-López Y, Torrens F, Artiles-Martínez LM, Pino-Urias RW, Martínez-Santiago O (2013) J Comp Chem 34(4):259CrossRefGoogle Scholar
  21. 21.
    Barigye SJ, Marrero-Ponce Y, Martínez-Santiago O, Martínez-López Y, Torrens F (2013) Curr Comput Aided Drug Des 9:164CrossRefPubMedGoogle Scholar
  22. 22.
    Barigye SJ, Marrero-Ponce Y, Pérez-Giménez F, Bonchev D (2014) Mol Divers 18(3):673CrossRefPubMedGoogle Scholar
  23. 23.
    Barigye SJ, Marrero-Ponce Y, Zupan J, Pérez-Giménez F, Freitas MP (2014) Bull Chem Soc Jpn 88(1):97CrossRefGoogle Scholar
  24. 24.
    Marrero-Ponce Y, Santiago OM, López YM, Barigye SJ, Torrens F (2012) J Comput Aided Mol Des 26(11):1229CrossRefPubMedGoogle Scholar
  25. 25.
    Xu M, Lee EM, Wen Z, Cheng Y, Huang W-K, Qian X, Julia T, Kouznetsova J, Ogden SC, Hammack C (2016) Nat Med 22(10):1101CrossRefPubMedPubMedCentralGoogle Scholar
  26. 26.
    He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, HobokenCrossRefGoogle Scholar
  27. 27.
    Barigye SJ, Marrero-Ponce Y (2016) Digital communication and chemical structure codification. In: Meyers RA (ed) Encyclopedia of complexity and systems science. Springer, Berlin, p 1Google Scholar
  28. 28.
    Urias RWP, Barigye SJ, Marrero-Ponce Y, García-Jacas CR, Valdes-Martiní JR, Perez-Gimenez F (2015) Mol Divers 19:305CrossRefGoogle Scholar
  29. 29.
    Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasetsGoogle Scholar
  30. 30.
    National Center for Biotechnology Information. Southern Research Specialized Biocontainment Screening Center. PubChem Database Accessed 17 Apr 2019.
  31. 31.
    Goodell JR, Puig-Basagoiti F, Forshey BM, Shi P-Y, Ferguson DM (2006) J Med Chem 49(6):2127CrossRefPubMedPubMedCentralGoogle Scholar
  32. 32.
    Behnam MAM, Graf D, Bartenschlager R, Zlotos DP, Klein CD (2015) J Med Chem 58(23):9354CrossRefGoogle Scholar
  33. 33.
    Aravapalli S, Lai H, Teramoto T, Alliston KR, Lushington GH, Ferguson EL, Padmanabhan R, Groutas WC (2012) Bioorg Med Chem 20(13):4140CrossRefPubMedPubMedCentralGoogle Scholar
  34. 34.
    National Center for Biotechnology Information. Southern Research Specialized Biocontainment Screening Center. PubChem Database Accessed 18 Apr 2019
  35. 35.
    National Center for Biotechnology Information. Southern Research Specialized Biocontainment Screening Center. PubChem Database Accessed 15 Apr 2019
  36. 36.
    National Center for Biotechnology Information. PubChem Database Accessed 20 Apr 2019
  37. 37.
    Tomek I (1976) IEEE Trans Syst Man Cybern 6(6):448Google Scholar
  38. 38.
    Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Nashville, Icml, p 179Google Scholar
  39. 39.
    Waring MJ, Arrowsmith J, Leach AR, Leeson PD, Mandrell S, Owen RM, Pairaudeau G, Pennie WD, Pickett SD, Wang J (2015) Nat Rev Drug Discov 14(7):475CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Departamento de Química Física AplicadaFacultad de Ciencias, Universidad Autónoma de Madrid (UAM)MadridSpain
  2. 2.Unidad de Toxicología Experimental, Universidad de Ciencias Médicas de Villa ClaraSanta ClaraCuba

Personalised recommendations