Quantitative Biology

, Volume 6, Issue 4, pp 334–343 | Cite as

Analysis of protein features and machine learning algorithms for prediction of druggable proteins

  • Tanlin Sun
  • Luhua Lai
  • Jianfeng Pei
Research Article



Computational tools have been widely used in drug discovery process since they reduce the time and cost. Prediction of whether a protein is druggable is fundamental and crucial for drug research pipeline. Sequence based protein function prediction plays vital roles in many research areas. Training data, protein features selection and machine learning algorithms are three indispensable elements that drive the successfulness of the models.


In this study, we tested the performance of different combinations of protein features and machine learning algorithms, based on FDA-approved small molecules’ targets, in druggable proteins prediction.We also enlarged the dataset to include the targets of small molecules that were in experiment or clinical investigation.


We found that although the 146-d vector used by Li et al. with neuron network achieved the best training accuracy of 91.10%, overlapped 3-gram word2vec with logistic regression achieved best prediction accuracy on independent test set (89.55%) and on newly approved-targets. Enlarged dataset with targets of small molecules in experiment and clinical investigation were trained. Unfortunately, the best training accuracy was only 75.48%. In addition, we applied our models to predict potential targets for references in future study.


Our study indicates the potential ability of word2vec in the prediction of druggable protein. And the training dataset of druggable protein should not be extended to targets that are lack of verification. The target prediction package could be found on


druggable protein drug target word2vec deep learning 



This work was supported in part by the Ministry of Science and Technology of China (No. 2016YFA0502303) and the National Natural Science Foundation of China (Nos. 21673010 and 81273436). The authors would like to thank Youjun Xu, Shuaishi Gao, Qiwan Hu for discussion and advices.


  1. 1.
    The UniProt Consortium. (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 45, D158–D169Google Scholar
  2. 2.
    Wishart, D. S., Feunang, Y. D., Guo, A. C., Lo, E. J., Marcu, A., Grant, J. R., Sajed, T., Johnson, D., Li, C., Sayeeda, Z., et al. (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res., 46, D1074–D1082CrossRefGoogle Scholar
  3. 3.
    Butcher, S. P. (2003) Target discovery and validation in the postgenomic era. Neurochem. Res., 28, 367–371CrossRefGoogle Scholar
  4. 4.
    Dundas, J., Ouyang, Z., Tseng, J., Binkowski, A., Turpaz, Y. and Liang, J. (2006) CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res., 34, W116–W118CrossRefGoogle Scholar
  5. 5.
    Schmidtke, P., Le Guilloux, V., Maupetit, J. and Tufféry, P. (2010) fpocket: online tools for protein ensemble pocket detection and tracking. Nucleic Acids Res., 38, W582–W589CrossRefGoogle Scholar
  6. 6.
    Hussein, H. A., Borrel, A., Geneix, C., Petitjean, M., Regad, L. and Camproux, A.-C. (2015) PockDrug-Server: a new web server for predicting pocket druggability on holo and apo proteins. Nucleic Acids Res., 43, W436–W442CrossRefGoogle Scholar
  7. 7.
    Yuan, Y., Pei, J. and Lai, L. (2013) Binding site detection and druggability prediction of protein targets for structure-based drug design. Curr. Pharm. Des., 19, 2326–2333CrossRefGoogle Scholar
  8. 8.
    Hajduk, P. J., Huth, J. R. and Fesik, S. W. (2005) Druggability indices for protein targets derived from NMR-based screening data. J. Med. Chem., 48, 2518–2525CrossRefGoogle Scholar
  9. 9.
    Rose, P. W., Prlic, A., Altunkaya, A., Bi, C., Bradley, A. R., Christie, C. H., Costanzo, L. D., Duarte, J. M., Dutta, S. and Feng, Z. (2016) The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res., 45, D271–D281Google Scholar
  10. 10.
    Mitsopoulos, C., Schierz, A. C., Workman, P. and Al-Lazikani, B. (2015) Distinctive behaviors of druggable proteins in cellular networks. PLoS Comput. Biol., 11, e1004597CrossRefGoogle Scholar
  11. 11.
    Lipinski, C. A. (2004) Lead-and drug-like compounds: the rule-offive revolution. Drug Discov. Today Technol., 1, 337–341CrossRefGoogle Scholar
  12. 12.
    Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. and Hopkins, A. L. (2012) Quantifying the chemical beauty of drugs. Nat. Chem., 4, 90–98CrossRefGoogle Scholar
  13. 13.
    Li, Q. and Lai, L. (2007) Prediction of potential drug targets based on simple sequence properties. BMC Bioinformatics, 8, 353CrossRefGoogle Scholar
  14. 14.
    Jamali, A. A., Ferdousi, R., Razzaghi, S., Li, J., Safdari, R. and Ebrahimie, E. (2016) DrugMiner: comparative analysis of machine learning algorithms for prediction of potential druggable proteins. Drug Discov. Today, 21, 718–724CrossRefGoogle Scholar
  15. 15.
    Guo, Y., Yu, L., Wen, Z. and Li, M. (2008) Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res., 36, 3025–3030CrossRefGoogle Scholar
  16. 16.
    Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y. and Jiang, H. (2007) Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA, 104, 4337–4341CrossRefGoogle Scholar
  17. 17.
    Asgari, E. and Mofrad, M. R. (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10, e0141287CrossRefGoogle Scholar
  18. 18.
    Wallach, H. M. (2006) Topic modeling: beyond bag-of-words. In ICML ’06 Proceedings of the 23rd International Conference on Machine learning. pp. 977–984, PittsburghCrossRefGoogle Scholar
  19. 19.
    Xue, B., Fu, C. and Shaobin, Z. (2014) A study on sentiment computing and classification of sina weibo with word2vec. In 2014 IEEE International Congress on Big Data. pp. 358–363. AnchorageCrossRefGoogle Scholar
  20. 20.
    Chung, Y.-A., Wu, C.-C., Shen, C.-H., Lee, H.-Y. and Lee, L.-S. (2016) Audio word2vec: unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv, 1603.00982Google Scholar
  21. 21.
    Ngo, D. L., Yamamoto, N., Tran, V. A., Nguyen, N. G., Phan, D., Lumbanraja, F. R., Kubo, M. and Satou, K. (2016) Application of word embedding to drug repositioning. J. Biomed. Sci. Eng., 9, 7–16CrossRefGoogle Scholar
  22. 22.
    Kimothi, D., Soni, A., Biyani, P. and Hogan, J. M. (2016) Distributed Representations for Biological Sequence Analysis. arXiv:1608.05949Google Scholar
  23. 23.
    Vang, Y. S. and Xie, X. (2017) HLA class I binding prediction via convolutional neural networks. Bioinformatics, 33, 2658–2665CrossRefGoogle Scholar
  24. 24.
    Kanehisa, M. and Goto, S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28, 27–30CrossRefGoogle Scholar
  25. 25.
    Zeng, Y. H., Guo, Y. Z., Xiao, R. Q., Yang, L., Yu, L. Z. and Li, M. L. (2009) Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J. Theor. Biol., 259, 366–372CrossRefGoogle Scholar
  26. 26.
    Liu, T., Geng, X., Zheng, X., Li, R. and Wang, J. (2012) Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles. Amino Acids, 42, 2243–2249CrossRefGoogle Scholar
  27. 27.
    Wang, Y.-C., Wang, X.-B., Yang, Z.-X. and Deng, N.-Y. (2010) Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature. Protein Pept. Lett., 17, 1441–1449CrossRefGoogle Scholar
  28. 28.
    Ottis, P., Toure, M., Cromm, P. M., Ko, E., Gustafson, J. L. and Crews, C. M. (2017) Assessing different E3 ligases for small molecule induced protein ubiquitination and degradation. ACS Chem. Biol., 12, 2570–2578CrossRefGoogle Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Center for Quantitative Biology, Academy for Advanced Interdisciplinary StudiesPeking UniversityBeijingChina
  2. 2.Beijing National Laboratory for Molecular Science, State Key Laboratory for Structural Chemistry of Unstable and Stable Species, College of Chemistry and Molecular EngineeringPeking UniversityBeijingChina
  3. 3.Peking-Tsinghua Center for Life SciencesPeking UniversityBeijingChina

Personalised recommendations