Journal of Computer-Aided Molecular Design

, Volume 27, Issue 6, pp 551–567 | Cite as

eFindSite: Improved prediction of ligand binding sites in protein models using meta-threading, machine learning and auxiliary ligands

  • Michal Brylinski
  • Wei P. Feinstein


Molecular structures and functions of the majority of proteins across different species are yet to be identified. Much needed functional annotation of these gene products often benefits from the knowledge of protein–ligand interactions. Towards this goal, we developed eFindSite, an improved version of FINDSITE, designed to more efficiently identify ligand binding sites and residues using only weakly homologous templates. It employs a collection of effective algorithms, including highly sensitive meta-threading approaches, improved clustering techniques, advanced machine learning methods and reliable confidence estimation systems. Depending on the quality of target protein structures, eFindSite outperforms geometric pocket detection algorithms by 15–40 % in binding site detection and by 5–35 % in binding residue prediction. Moreover, compared to FINDSITE, it identifies 14 % more binding residues in the most difficult cases. When multiple putative binding pockets are identified, the ranking accuracy is 75–78 %, which can be further improved by 3–4 % by including auxiliary information on binding ligands extracted from biomedical literature. As a first across-genome application, we describe structure modeling and binding site prediction for the entire proteome of Escherichia coli. Carefully calibrated confidence estimates strongly indicate that highly reliable ligand binding predictions are made for the majority of gene products, thus eFindSite holds a significant promise for large-scale genome annotation and drug development projects. eFindSite is freely available to the academic community at


Ligand binding site prediction Binding residue prediction Protein threading Ligand virtual screening Machine learning Support vector machines 



This study was supported by the Louisiana Board of Regents through the Board of Regents Support Fund [contract LEQSF(2012–15)-RD-A-05] and Oak Ridge Associated Universities (ORAU) through the 2012 Ralph E. Powe Junior Faculty Enhancement Award. Portions of this research were conducted with high performance computational resources provided by Louisiana State University (


  1. 1.
    Hoehndorf R, Kelso J, Herre H (2009) The ontology of biological sequences. BMC Bioinformatics 10:377CrossRefGoogle Scholar
  2. 2.
    Stevens R, Goble CA, Bechhofer S (2000) Ontology-based knowledge representation for bioinformatics. Brief Bioinformatics 1(4):398–414CrossRefGoogle Scholar
  3. 3.
    Ashburner M et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–29CrossRefGoogle Scholar
  4. 4.
    Harris MA et al (2004) The gene ontology (GO) database and informatics resource. Nucleic Acids Res, 32(Database issue): D258–61Google Scholar
  5. 5.
    Lybrand TP (2002) In: Naray-Szabo G, Warshel A (eds) Protein-ligand interactions, in computational approaches to biochemical reactivity. Springer, Boston, pp 363–374Google Scholar
  6. 6.
    Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11(1):31–46CrossRefGoogle Scholar
  7. 7.
    Zhang J et al (2011) The impact of next-generation sequencing on genomics. J Genet Genomics 38(3):95–109CrossRefGoogle Scholar
  8. 8.
    Juncker AS et al (2009) Sequence-based feature prediction and annotation of proteins. Genome Biol 10(2):206CrossRefGoogle Scholar
  9. 9.
    Loewenstein Y et al (2009) Protein function annotation by homology-based inference. Genome Biol 10(2):207CrossRefGoogle Scholar
  10. 10.
    Ahmad S, Sarai A (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics 6:33CrossRefGoogle Scholar
  11. 11.
    Hwang S, Gou Z, Kuznetsov IB (2007) DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 23(5):634–636CrossRefGoogle Scholar
  12. 12.
    Chen P, Li J (2010) Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information. BMC Bioinformatics 11:402CrossRefGoogle Scholar
  13. 13.
    Chen XW, Jeong JC (2009) Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 25(5):585–591CrossRefGoogle Scholar
  14. 14.
    Soding J (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics 21(7):951–960CrossRefGoogle Scholar
  15. 15.
    Lopez G et al (2011) Firestar—advances in the prediction of functionally important residues. Nucleic Acids Res 39(Web Server issue): W235–41Google Scholar
  16. 16.
    Lord PW et al (2003) Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19(10):1275–1283CrossRefGoogle Scholar
  17. 17.
    Schnoes AM et al (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 5(12):e1000605CrossRefGoogle Scholar
  18. 18.
    Zhang QC et al (2011) PredUs: a web server for predicting protein interfaces using structural neighbors. Nucleic Acids Res 39(Web Server issue): W283–7Google Scholar
  19. 19.
    Brylinski M et al (2007) Prediction of functional sites based on the fuzzy oil drop model. PLoS Comput Biol 3(5):e94CrossRefGoogle Scholar
  20. 20.
    Brylinski M et al (2007) Localization of ligand binding site in proteins identified in silico. J Mol Model 13(6–7):665–675CrossRefGoogle Scholar
  21. 21.
    Dudev M, Lim C (2007) Discovering structural motifs using a structural alphabet: application to magnesium-binding sites. BMC Bioinformatics 8:106CrossRefGoogle Scholar
  22. 22.
    Laskowski RA (1995) SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 13(5):323–30, 307–8Google Scholar
  23. 23.
    Liang J, Edelsbrunner H, Woodward C (1998) Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci 7(9):1884–1897CrossRefGoogle Scholar
  24. 24.
    Levitt DG, Banaszak LJ (1992) POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. J Mol Graph 10(4):229–234CrossRefGoogle Scholar
  25. 25.
    Huang B, Schroeder M (2006) LIGSITEcsc: predicting ligand binding sites using the connolly surface and degree of conservation. BMC Struct Biol 6:19CrossRefGoogle Scholar
  26. 26.
    Le Guilloux V, Schmidtke P, Tuffery P (2009) Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 10:168CrossRefGoogle Scholar
  27. 27.
    Zhu H, Pisabarro MT (2011) MSPocket: an orientation-independent algorithm for the detection of ligand binding pockets. Bioinformatics 27(3):351–358CrossRefGoogle Scholar
  28. 28.
    Huang B (2009) MetaPocket: a meta approach to improve protein ligand binding site prediction. OMICS 13(4):325–330CrossRefGoogle Scholar
  29. 29.
    Skolnick J, Brylinski M (2009) FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinformatics 10(4):378–391CrossRefGoogle Scholar
  30. 30.
    Wass MN, Kelley LA, Sternberg MJ (2010) 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res 38(Web Server issue): W469–73Google Scholar
  31. 31.
    Brylinski M, Skolnick J (2008) A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci U S A 105(1):129–134CrossRefGoogle Scholar
  32. 32.
    Roche DB, Tetchner SJ, McGuffin LJ (2011) FunFOLD: an improved automated method for the prediction of ligand binding residues using 3D models of proteins. BMC Bioinformatics 12:160CrossRefGoogle Scholar
  33. 33.
    Brylinski M, Skolnick J (2011) FINDSITE-metal: integrating evolutionary information and machine learning for structure-based metal-binding site prediction at the proteome level. Proteins 79(3):735–751CrossRefGoogle Scholar
  34. 34.
    Dror I et al (2011) Predicting nucleic acid binding interfaces from structural models of proteins. ProteinsGoogle Scholar
  35. 35.
    Mukherjee S, Zhang Y (2011) Protein-protein complex structure predictions by multimeric threading and template recombination. Structure 19(7):955–966CrossRefGoogle Scholar
  36. 36.
    Tyagi M et al (2012) Homology inference of protein–protein interactions via conserved binding sites. PLoS ONE 7(1):e28896CrossRefGoogle Scholar
  37. 37.
    Pandit SB, Skolnick J (2008) Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score. BMC Bioinformatics 9:531CrossRefGoogle Scholar
  38. 38.
    Ortiz AR, Strauss CE, Olmea O (2002) MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci 11(11):2606–2621CrossRefGoogle Scholar
  39. 39.
    Russell RB, Sasieni PD, Sternberg MJ (1998) Supersites within superfolds. Binding site similarity in the absence of homology. J Mol Biol 282(4):903–918CrossRefGoogle Scholar
  40. 40.
    Brylinski M, Skolnick J (2010) Comparison of structure-based and threading-based approaches to protein functional annotation. Proteins 78(1):118–134CrossRefGoogle Scholar
  41. 41.
    Laurie AT, Jackson RM (2006) Methods for the prediction of protein-ligand binding sites for structure-based drug design and virtual ligand screening. Curr Protein Pept Sci 7(5):395–406CrossRefGoogle Scholar
  42. 42.
    Li YY, An J, Jones SJ (2006) A large-scale computational approach to drug repositioning. Genome Inform 17(2):239–247Google Scholar
  43. 43.
    Li YY, An J, Jones SJ (2011) A computational approach to finding novel targets for existing drugs. PLoS Comput Biol 7(9):e1002139CrossRefGoogle Scholar
  44. 44.
    Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976CrossRefGoogle Scholar
  45. 45.
    Brylinski M, Lingam D (2012) eThread: a highly optimized machine learning-based approach to meta-threading and the modeling of protein tertiary structures. PLoS ONE 7(11):e50200CrossRefGoogle Scholar
  46. 46.
    Brylinski M, Feinstein WP (2012) Setting up a meta-threading pipeline for high-throughput structural bioinformatics: eThread software distribution, walkthrough and resource profiling. J Comput Sci Syst Biol 6(1):001–010Google Scholar
  47. 47.
    Wallach I, Lilien R (2009) The protein-small-molecule database, a non-redundant structural resource for the analysis of protein-ligand binding. Bioinformatics 25(5):615–620CrossRefGoogle Scholar
  48. 48.
    Wang G, Dunbrack RL Jr (2003) PISCES: a protein sequence culling server. Bioinformatics 19(12):1589–1591CrossRefGoogle Scholar
  49. 49.
    Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57(4):702–710CrossRefGoogle Scholar
  50. 50.
    Berman HM et al (2000) The protein data bank. Nucleic Acids Res 28(1):235–242CrossRefGoogle Scholar
  51. 51.
    Bindewald E, Skolnick J (2005) A scoring function for docking ligands to low-resolution protein structures. J Comput Chem 26(4):374–383CrossRefGoogle Scholar
  52. 52.
    Biegert A, Soding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci USA 106(10):3770–3775CrossRefGoogle Scholar
  53. 53.
    Sadreyev R, Grishin N (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 326(1):317–336CrossRefGoogle Scholar
  54. 54.
    Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763CrossRefGoogle Scholar
  55. 55.
    Bucher P et al (1996) A flexible motif search technique based on generalized profiles. Comput Chem 20(1):3–23CrossRefGoogle Scholar
  56. 56.
    Lobley A, Sadowski MI, Jones DT (2009) pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination. Bioinformatics 25(14):1761–1767CrossRefGoogle Scholar
  57. 57.
    Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 12(2):95–107Google Scholar
  58. 58.
    Zhou H, Zhou Y (2005) SPARKS 2 and SP3 servers in CASP6. Proteins 61(Suppl 7):152–156CrossRefGoogle Scholar
  59. 59.
    Jones DT, Taylor WR, Thornton JM (1992) A new approach to protein fold recognition. Nature 358(6381):86–89CrossRefGoogle Scholar
  60. 60.
    Tanimoto TT (1958) An elementary mathematical theory of classification and prediction, in IBM Internal ReportGoogle Scholar
  61. 61.
    Guha R et al (2006) The blue obelisk-interoperability in chemical informatics. J Chem Inf Model 46(3):991–998CrossRefGoogle Scholar
  62. 62.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd ed. Morgan Kaufmann Publishers, San FranciscoGoogle Scholar
  63. 63.
    Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453CrossRefGoogle Scholar
  64. 64.
    Roy A, Kucukural A, Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc 5(4):725–738CrossRefGoogle Scholar
  65. 65.
    Soga S et al (2007) Use of amino acid composition to predict ligand-binding sites. J Chem Inf Model 47(2):400–406CrossRefGoogle Scholar
  66. 66.
    Marti-Renom MA et al (2007) The AnnoLite and AnnoLyze programs for comparative annotation of protein structures. BMC Bioinformatics 8(Suppl 4):S4CrossRefGoogle Scholar
  67. 67.
    Liu T, Altman RB (2009) Prediction of calcium-binding sites by combining loop-modeling with machine learning. BMC Struct Biol 9:72CrossRefGoogle Scholar
  68. 68.
    Kawabata T (2010) Detection of multiscale pockets on protein surfaces using mathematical morphology. Proteins 78(5):1195–1211CrossRefGoogle Scholar
  69. 69.
    Zhang Z et al (2011) Identification of cavities on protein surface using multiple computational approaches for drug binding site prediction. Bioinformatics 27(15):2083–2088CrossRefGoogle Scholar
  70. 70.
    Blattner FR et al (1997) The complete genome sequence of Escherichia coli K-12. Science 277(5331):1453–1462CrossRefGoogle Scholar
  71. 71.
    Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234(3):779–815CrossRefGoogle Scholar
  72. 72.
    Pandit SB, Zhang Y, Skolnick J (2006) TASSER-Lite: an automated tool for protein comparative modeling. Biophys J 91(11):4180–4190CrossRefGoogle Scholar
  73. 73.
    Brylinski M, Skolnick J (2007) What is the relationship between the global structures of apo and holo proteins? Proteins 70(2):363–377CrossRefGoogle Scholar
  74. 74.
    Chen X, Liu M, Gilson MK (2001) BindingDB: a web-accessible molecular recognition database. Comb Chem High Throughput Screen 4(8):719–725CrossRefGoogle Scholar
  75. 75.
    Wang Y et al (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res, 37(Web Server issue): W623–33Google Scholar
  76. 76.
    Wishart DS et al (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34(Database issue): D668–72Google Scholar
  77. 77.
    Jacquet E, Parmeggiani A (1988) Structure-function relationships in the GTP binding domain of EF-Tu: mutation of Val20, the residue homologous to position 12 in p21. EMBO J 7(9):2861–2867Google Scholar
  78. 78.
    Weijland A et al (1993) Asparagine-135 of elongation factor Tu is a crucial residue for the folding of the guanine nucleotide binding pocket. FEBS Lett 330(3):334–338CrossRefGoogle Scholar
  79. 79.
    Gumusel F et al (1990) Mutagenesis of the NH2-terminal domain of elongation factor Tu. Biochim Biophys Acta 1050(1–3):215–221Google Scholar
  80. 80.
    Stebbins JW et al (1992) Arginine 54 in the active site of Escherichia coli aspartate transcarbamoylase is critical for catalysis: a site-specific mutagenesis, NMR, and X-ray crystallographic study. Protein Sci 1(11):1435–1446CrossRefGoogle Scholar
  81. 81.
    Waldrop GL et al (1992) The contribution of threonine 55 to catalysis in aspartate transcarbamoylase. Biochemistry 31(28):6592–6597CrossRefGoogle Scholar
  82. 82.
    Jin L, Stec B, Kantrowitz ER (2000) A cis-proline to alanine mutant of E. coli aspartate transcarbamoylase: kinetic studies and three-dimensional crystal structures. Biochemistry 39(27):8058–8066CrossRefGoogle Scholar
  83. 83.
    Kitano H (2002) Systems biology: a brief overview. Science 295(5560):1662–1664CrossRefGoogle Scholar
  84. 84.
    Xue L et al (2003) Design and evaluation of a molecular fingerprint involving the transformation of property descriptor values into a binary classification scheme. J Chem Inf Comput Sci 43(4):1151–1157CrossRefGoogle Scholar
  85. 85.
    Willett P (1998) Chemical similarity searching. J Chem Inf Model 38:983–996CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  1. 1.Department of Biological SciencesLouisiana State UniversityBaton RougeUSA
  2. 2.Center for Computation and TechnologyLouisiana State UniversityBaton RougeUSA

Personalised recommendations