Undersampling: case studies of flaviviral inhibitory activities
Imbalanced datasets, comprising of more inactive compounds relative to the active ones, are a common challenge in ligand-based model building workflows for drug discovery. This is particularly true for neglected tropical diseases since efforts to identify therapeutics for these diseases are often limited. In this report, we analyze the performance of several undersampling strategies in modeling the Dengue Virus 2 (DENV2) inhibitory activity, as well as the anti-flaviviral activities for the West Nile (WNV) and Zika (ZIKV) viruses. To this end, we build datasets comprising of 1218 (159 actives and 1059 inactives), 1044 (132 actives and 912 inactives) and 302 (75 actives and 227 inactives) molecules with known DENV2, WNV and ZIKV inhibitory activity profiles, respectively. We develop ensemble classifiers for these endpoints and compare the performance of the different undersampling algorithms on external sets. It is observed that data pruning algorithms yield superior performance relative to data selection algorithms. The best overall performance is provided by the one-sided selection algorithm with test set balanced accuracy (BACC) values of 0.84, 0.74 and 0.77 for the DENV2, WNV and ZIKV inhibitory activities, respectively. For the model building, we use the recently proposed GT-STAF information indices, and compare the predictivity of 3 molecular fragmentation approaches: connected subgraphs, substructure and alogp atom types, which are observed to show comparable performance. On the other hand, a combination of indices based on these fragmentation strategies enhances the predictivity of the built ensembles. The built models could be useful for screening new molecules with possible DENV, WNV and ZIKV inhibitory activities. ADMET modelers are encouraged to adopt undersampling algorithms in their workflows when dealing with imbalanced datasets.
KeywordsDengue virus West nile virus Zika virus Undersampling Support vector machine Information index
West nile virus
- GT-STAF IFI
Graph Theoretical Thermodynamic STAte Functions Information Index
Quantitative structure–activity relationships
The authors appreciate the reviewers for their valuable comments and taking the time to revise the submitted python scripts.
- 12.He G, Han H, Wang W (2005) An over-sampling expert system for learing from imbalanced data sets. 2005 International Conference on Neural Networks and Brain: IEEE, p 537Google Scholar
- 27.Barigye SJ, Marrero-Ponce Y (2016) Digital communication and chemical structure codification. In: Meyers RA (ed) Encyclopedia of complexity and systems science. Springer, Berlin, p 1Google Scholar
- 29.Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasetsGoogle Scholar
- 30.National Center for Biotechnology Information. Southern Research Specialized Biocontainment Screening Center. PubChem Database https://pubchem.ncbi.nlm.nih.gov/bioassay/540333. Accessed 17 Apr 2019.
- 34.National Center for Biotechnology Information. Southern Research Specialized Biocontainment Screening Center. PubChem Database https://pubchem.ncbi.nlm.nih.gov/bioassay/1650. Accessed 18 Apr 2019
- 35.National Center for Biotechnology Information. Southern Research Specialized Biocontainment Screening Center. PubChem Database https://pubchem.ncbi.nlm.nih.gov/bioassay/588371. Accessed 15 Apr 2019
- 36.National Center for Biotechnology Information. PubChem Database https://pubchem.ncbi.nlm.nih.gov/bioassay/1079778. Accessed 20 Apr 2019
- 37.Tomek I (1976) IEEE Trans Syst Man Cybern 6(6):448Google Scholar
- 38.Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Nashville, Icml, p 179Google Scholar