Skip to main content
Log in

eFindSite: Improved prediction of ligand binding sites in protein models using meta-threading, machine learning and auxiliary ligands

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

Molecular structures and functions of the majority of proteins across different species are yet to be identified. Much needed functional annotation of these gene products often benefits from the knowledge of protein–ligand interactions. Towards this goal, we developed eFindSite, an improved version of FINDSITE, designed to more efficiently identify ligand binding sites and residues using only weakly homologous templates. It employs a collection of effective algorithms, including highly sensitive meta-threading approaches, improved clustering techniques, advanced machine learning methods and reliable confidence estimation systems. Depending on the quality of target protein structures, eFindSite outperforms geometric pocket detection algorithms by 15–40 % in binding site detection and by 5–35 % in binding residue prediction. Moreover, compared to FINDSITE, it identifies 14 % more binding residues in the most difficult cases. When multiple putative binding pockets are identified, the ranking accuracy is 75–78 %, which can be further improved by 3–4 % by including auxiliary information on binding ligands extracted from biomedical literature. As a first across-genome application, we describe structure modeling and binding site prediction for the entire proteome of Escherichia coli. Carefully calibrated confidence estimates strongly indicate that highly reliable ligand binding predictions are made for the majority of gene products, thus eFindSite holds a significant promise for large-scale genome annotation and drug development projects. eFindSite is freely available to the academic community at http://www.brylinski.org/efindsite.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Hoehndorf R, Kelso J, Herre H (2009) The ontology of biological sequences. BMC Bioinformatics 10:377

    Article  Google Scholar 

  2. Stevens R, Goble CA, Bechhofer S (2000) Ontology-based knowledge representation for bioinformatics. Brief Bioinformatics 1(4):398–414

    Article  CAS  Google Scholar 

  3. Ashburner M et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–29

    Article  CAS  Google Scholar 

  4. Harris MA et al (2004) The gene ontology (GO) database and informatics resource. Nucleic Acids Res, 32(Database issue): D258–61

    Google Scholar 

  5. Lybrand TP (2002) In: Naray-Szabo G, Warshel A (eds) Protein-ligand interactions, in computational approaches to biochemical reactivity. Springer, Boston, pp 363–374

  6. Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11(1):31–46

    Article  CAS  Google Scholar 

  7. Zhang J et al (2011) The impact of next-generation sequencing on genomics. J Genet Genomics 38(3):95–109

    Article  Google Scholar 

  8. Juncker AS et al (2009) Sequence-based feature prediction and annotation of proteins. Genome Biol 10(2):206

    Article  Google Scholar 

  9. Loewenstein Y et al (2009) Protein function annotation by homology-based inference. Genome Biol 10(2):207

    Article  Google Scholar 

  10. Ahmad S, Sarai A (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics 6:33

    Article  Google Scholar 

  11. Hwang S, Gou Z, Kuznetsov IB (2007) DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 23(5):634–636

    Article  CAS  Google Scholar 

  12. Chen P, Li J (2010) Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information. BMC Bioinformatics 11:402

    Article  CAS  Google Scholar 

  13. Chen XW, Jeong JC (2009) Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 25(5):585–591

    Article  Google Scholar 

  14. Soding J (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics 21(7):951–960

    Article  Google Scholar 

  15. Lopez G et al (2011) Firestar—advances in the prediction of functionally important residues. Nucleic Acids Res 39(Web Server issue): W235–41

    Google Scholar 

  16. Lord PW et al (2003) Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19(10):1275–1283

    Article  CAS  Google Scholar 

  17. Schnoes AM et al (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 5(12):e1000605

    Article  Google Scholar 

  18. Zhang QC et al (2011) PredUs: a web server for predicting protein interfaces using structural neighbors. Nucleic Acids Res 39(Web Server issue): W283–7

    Google Scholar 

  19. Brylinski M et al (2007) Prediction of functional sites based on the fuzzy oil drop model. PLoS Comput Biol 3(5):e94

    Article  Google Scholar 

  20. Brylinski M et al (2007) Localization of ligand binding site in proteins identified in silico. J Mol Model 13(6–7):665–675

    Article  CAS  Google Scholar 

  21. Dudev M, Lim C (2007) Discovering structural motifs using a structural alphabet: application to magnesium-binding sites. BMC Bioinformatics 8:106

    Article  Google Scholar 

  22. Laskowski RA (1995) SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 13(5):323–30, 307–8

    Google Scholar 

  23. Liang J, Edelsbrunner H, Woodward C (1998) Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci 7(9):1884–1897

    Article  CAS  Google Scholar 

  24. Levitt DG, Banaszak LJ (1992) POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. J Mol Graph 10(4):229–234

    Article  CAS  Google Scholar 

  25. Huang B, Schroeder M (2006) LIGSITEcsc: predicting ligand binding sites using the connolly surface and degree of conservation. BMC Struct Biol 6:19

    Article  Google Scholar 

  26. Le Guilloux V, Schmidtke P, Tuffery P (2009) Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 10:168

    Article  Google Scholar 

  27. Zhu H, Pisabarro MT (2011) MSPocket: an orientation-independent algorithm for the detection of ligand binding pockets. Bioinformatics 27(3):351–358

    Article  CAS  Google Scholar 

  28. Huang B (2009) MetaPocket: a meta approach to improve protein ligand binding site prediction. OMICS 13(4):325–330

    Article  CAS  Google Scholar 

  29. Skolnick J, Brylinski M (2009) FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinformatics 10(4):378–391

    Article  CAS  Google Scholar 

  30. Wass MN, Kelley LA, Sternberg MJ (2010) 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res 38(Web Server issue): W469–73

    Google Scholar 

  31. Brylinski M, Skolnick J (2008) A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci U S A 105(1):129–134

    Article  CAS  Google Scholar 

  32. Roche DB, Tetchner SJ, McGuffin LJ (2011) FunFOLD: an improved automated method for the prediction of ligand binding residues using 3D models of proteins. BMC Bioinformatics 12:160

    Article  CAS  Google Scholar 

  33. Brylinski M, Skolnick J (2011) FINDSITE-metal: integrating evolutionary information and machine learning for structure-based metal-binding site prediction at the proteome level. Proteins 79(3):735–751

    Article  CAS  Google Scholar 

  34. Dror I et al (2011) Predicting nucleic acid binding interfaces from structural models of proteins. Proteins

  35. Mukherjee S, Zhang Y (2011) Protein-protein complex structure predictions by multimeric threading and template recombination. Structure 19(7):955–966

    Article  CAS  Google Scholar 

  36. Tyagi M et al (2012) Homology inference of protein–protein interactions via conserved binding sites. PLoS ONE 7(1):e28896

    Article  CAS  Google Scholar 

  37. Pandit SB, Skolnick J (2008) Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score. BMC Bioinformatics 9:531

    Article  Google Scholar 

  38. Ortiz AR, Strauss CE, Olmea O (2002) MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci 11(11):2606–2621

    Article  CAS  Google Scholar 

  39. Russell RB, Sasieni PD, Sternberg MJ (1998) Supersites within superfolds. Binding site similarity in the absence of homology. J Mol Biol 282(4):903–918

    Article  CAS  Google Scholar 

  40. Brylinski M, Skolnick J (2010) Comparison of structure-based and threading-based approaches to protein functional annotation. Proteins 78(1):118–134

    Article  CAS  Google Scholar 

  41. Laurie AT, Jackson RM (2006) Methods for the prediction of protein-ligand binding sites for structure-based drug design and virtual ligand screening. Curr Protein Pept Sci 7(5):395–406

    Article  CAS  Google Scholar 

  42. Li YY, An J, Jones SJ (2006) A large-scale computational approach to drug repositioning. Genome Inform 17(2):239–247

    CAS  Google Scholar 

  43. Li YY, An J, Jones SJ (2011) A computational approach to finding novel targets for existing drugs. PLoS Comput Biol 7(9):e1002139

    Article  CAS  Google Scholar 

  44. Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976

    Article  CAS  Google Scholar 

  45. Brylinski M, Lingam D (2012) eThread: a highly optimized machine learning-based approach to meta-threading and the modeling of protein tertiary structures. PLoS ONE 7(11):e50200

    Article  CAS  Google Scholar 

  46. Brylinski M, Feinstein WP (2012) Setting up a meta-threading pipeline for high-throughput structural bioinformatics: eThread software distribution, walkthrough and resource profiling. J Comput Sci Syst Biol 6(1):001–010

    Google Scholar 

  47. Wallach I, Lilien R (2009) The protein-small-molecule database, a non-redundant structural resource for the analysis of protein-ligand binding. Bioinformatics 25(5):615–620

    Article  CAS  Google Scholar 

  48. Wang G, Dunbrack RL Jr (2003) PISCES: a protein sequence culling server. Bioinformatics 19(12):1589–1591

    Article  CAS  Google Scholar 

  49. Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57(4):702–710

    Article  CAS  Google Scholar 

  50. Berman HM et al (2000) The protein data bank. Nucleic Acids Res 28(1):235–242

    Article  CAS  Google Scholar 

  51. Bindewald E, Skolnick J (2005) A scoring function for docking ligands to low-resolution protein structures. J Comput Chem 26(4):374–383

    Article  CAS  Google Scholar 

  52. Biegert A, Soding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci USA 106(10):3770–3775

    Article  CAS  Google Scholar 

  53. Sadreyev R, Grishin N (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 326(1):317–336

    Article  CAS  Google Scholar 

  54. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763

    Article  CAS  Google Scholar 

  55. Bucher P et al (1996) A flexible motif search technique based on generalized profiles. Comput Chem 20(1):3–23

    Article  CAS  Google Scholar 

  56. Lobley A, Sadowski MI, Jones DT (2009) pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination. Bioinformatics 25(14):1761–1767

    Article  CAS  Google Scholar 

  57. Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 12(2):95–107

    CAS  Google Scholar 

  58. Zhou H, Zhou Y (2005) SPARKS 2 and SP3 servers in CASP6. Proteins 61(Suppl 7):152–156

    Article  CAS  Google Scholar 

  59. Jones DT, Taylor WR, Thornton JM (1992) A new approach to protein fold recognition. Nature 358(6381):86–89

    Article  CAS  Google Scholar 

  60. Tanimoto TT (1958) An elementary mathematical theory of classification and prediction, in IBM Internal Report

  61. Guha R et al (2006) The blue obelisk-interoperability in chemical informatics. J Chem Inf Model 46(3):991–998

    Article  CAS  Google Scholar 

  62. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd ed. Morgan Kaufmann Publishers, San Francisco

  63. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

    Article  CAS  Google Scholar 

  64. Roy A, Kucukural A, Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc 5(4):725–738

    Article  CAS  Google Scholar 

  65. Soga S et al (2007) Use of amino acid composition to predict ligand-binding sites. J Chem Inf Model 47(2):400–406

    Article  CAS  Google Scholar 

  66. Marti-Renom MA et al (2007) The AnnoLite and AnnoLyze programs for comparative annotation of protein structures. BMC Bioinformatics 8(Suppl 4):S4

    Article  Google Scholar 

  67. Liu T, Altman RB (2009) Prediction of calcium-binding sites by combining loop-modeling with machine learning. BMC Struct Biol 9:72

    Article  Google Scholar 

  68. Kawabata T (2010) Detection of multiscale pockets on protein surfaces using mathematical morphology. Proteins 78(5):1195–1211

    Article  CAS  Google Scholar 

  69. Zhang Z et al (2011) Identification of cavities on protein surface using multiple computational approaches for drug binding site prediction. Bioinformatics 27(15):2083–2088

    Article  CAS  Google Scholar 

  70. Blattner FR et al (1997) The complete genome sequence of Escherichia coli K-12. Science 277(5331):1453–1462

    Article  CAS  Google Scholar 

  71. Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234(3):779–815

    Article  CAS  Google Scholar 

  72. Pandit SB, Zhang Y, Skolnick J (2006) TASSER-Lite: an automated tool for protein comparative modeling. Biophys J 91(11):4180–4190

    Article  CAS  Google Scholar 

  73. Brylinski M, Skolnick J (2007) What is the relationship between the global structures of apo and holo proteins? Proteins 70(2):363–377

    Article  Google Scholar 

  74. Chen X, Liu M, Gilson MK (2001) BindingDB: a web-accessible molecular recognition database. Comb Chem High Throughput Screen 4(8):719–725

    Article  CAS  Google Scholar 

  75. Wang Y et al (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res, 37(Web Server issue): W623–33

    Google Scholar 

  76. Wishart DS et al (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34(Database issue): D668–72

    Google Scholar 

  77. Jacquet E, Parmeggiani A (1988) Structure-function relationships in the GTP binding domain of EF-Tu: mutation of Val20, the residue homologous to position 12 in p21. EMBO J 7(9):2861–2867

    CAS  Google Scholar 

  78. Weijland A et al (1993) Asparagine-135 of elongation factor Tu is a crucial residue for the folding of the guanine nucleotide binding pocket. FEBS Lett 330(3):334–338

    Article  CAS  Google Scholar 

  79. Gumusel F et al (1990) Mutagenesis of the NH2-terminal domain of elongation factor Tu. Biochim Biophys Acta 1050(1–3):215–221

    CAS  Google Scholar 

  80. Stebbins JW et al (1992) Arginine 54 in the active site of Escherichia coli aspartate transcarbamoylase is critical for catalysis: a site-specific mutagenesis, NMR, and X-ray crystallographic study. Protein Sci 1(11):1435–1446

    Article  CAS  Google Scholar 

  81. Waldrop GL et al (1992) The contribution of threonine 55 to catalysis in aspartate transcarbamoylase. Biochemistry 31(28):6592–6597

    Article  CAS  Google Scholar 

  82. Jin L, Stec B, Kantrowitz ER (2000) A cis-proline to alanine mutant of E. coli aspartate transcarbamoylase: kinetic studies and three-dimensional crystal structures. Biochemistry 39(27):8058–8066

    Article  CAS  Google Scholar 

  83. Kitano H (2002) Systems biology: a brief overview. Science 295(5560):1662–1664

    Article  CAS  Google Scholar 

  84. Xue L et al (2003) Design and evaluation of a molecular fingerprint involving the transformation of property descriptor values into a binary classification scheme. J Chem Inf Comput Sci 43(4):1151–1157

    Article  CAS  Google Scholar 

  85. Willett P (1998) Chemical similarity searching. J Chem Inf Model 38:983–996

    Article  CAS  Google Scholar 

Download references

Acknowledgments

This study was supported by the Louisiana Board of Regents through the Board of Regents Support Fund [contract LEQSF(2012–15)-RD-A-05] and Oak Ridge Associated Universities (ORAU) through the 2012 Ralph E. Powe Junior Faculty Enhancement Award. Portions of this research were conducted with high performance computational resources provided by Louisiana State University (http://www.hpc.lsu.edu).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michal Brylinski.

Appendix

Appendix

Molecular fingerprints are bit strings that represent the structural and chemical features of organic compounds (see Daylight manual for details: http://www.daylight.com/dayhtml/doc/theory/index.pdf). Tanimoto coefficient is the most popular measure to quantify the similarity of two sets of bits (e.g. molecular fingerprints). Classical Tanimoto coefficient (TC) [60] is defined as:

$$ TC = \frac{c}{a + b + c} $$
(2)

where a is the count of bits on in the 1st string but not in the 2nd string, b is the count of bits on in the 2nd string but not in the 1st string, and c is the count of the bits on in both strings.

In addition to the classical Tanimoto coefficient, the overlap between two molecular fingerprints can be measured by the average Tanimoto coefficient (aveTC) [84]:

$$ aveTC = \frac{{TC + TC^{'} }}{2} $$
(3)

where TC′ is the Tanimoto coefficient calculated for bit positions set off rather than set on.

Furthermore, a version of the Tanimoto coefficient for continuous variables (conTC) [85] was developed:

$$ conTC = \frac{{\sum {x_{pi} x_{ci} } }}{{\sum {x_{pi}^{2} + } \sum {x_{ci}^{2} - \sum {x_{pi} x_{ci} } } }} $$
(4)

where x pi is the i-th descriptor of a fingerprint profile and x ci is the i-th descriptor of a query compound. The fingerprint profile is constructed from individual fingerprints for a set of compounds, e.g. template-bound ligands that were used to identify a putative binding site in the target structure.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brylinski, M., Feinstein, W.P. eFindSite: Improved prediction of ligand binding sites in protein models using meta-threading, machine learning and auxiliary ligands. J Comput Aided Mol Des 27, 551–567 (2013). https://doi.org/10.1007/s10822-013-9663-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-013-9663-5

Keywords

Navigation