Improving the binding affinity estimations of protein–ligand complexes using machine-learning facilitated force field method


Scoring functions are routinely deployed in structure-based drug design to quantify the potential for protein–ligand (PL) complex formation. Here, we present a new scoring function Bappl+ that is designed to predict the binding affinities of non-metallo and metallo PL complexes. Bappl+ outperforms other state-of-the-art scoring functions, achieving a high Pearson correlation coefficient of up to ~ 0.76 with low standard deviations. The biggest contributors to the increased performance are the use of a machine-learning model and the enlarged training dataset. We have also evaluated the performance of Bappl+ on target-specific proteins, which highlighted the limitations of our function and provides a way for further improvements. We believe that Bappl+ methodology could prove valuable in ranking candidate molecules against a target metallo or non-metallo protein by reliably predicting their binding affinities, thus helping in the drug discovery process.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

    Schulz-Gasch T, Stahl M (2004) Scoring functions for protein–ligand interactions: a critical perspective. Drug Discov Today Technol 1:231–239

    CAS  PubMed  Google Scholar 

  2. 2.

    Böhm H-J, Stahl M (2003) The use of scoring functions in drug discovery applications. In: Lipkowitz KB, Boyd DB (eds) Reviews in computational chemistry, vol 18. Wiley, Hoboken, pp 41–87

    Google Scholar 

  3. 3.

    Leach AR, Shoichet BK, Peishoff CE (2006) Prediction of protein−ligand interactions. docking and scoring: successes and gaps. J Med Chem 49:5851–5855

    CAS  PubMed  Google Scholar 

  4. 4.

    Kitchen DB, Decornez H, Furr JR, Bajorath J (2004) Docking and scoring in virtual screening for drug discovery: methods and applications. Nat Rev Drug Discov 3:935–949

    CAS  PubMed  Google Scholar 

  5. 5.

    Huang S-Y, Grinter SZ, Zou X (2010) Scoring functions and their evaluation methods for protein–ligand docking: recent advances and future directions. Phys Chem Chem Phys 12:12899–12908

    CAS  PubMed  Google Scholar 

  6. 6.

    Meng EC, Shoichet BK, Kuntz ID (1992) Automated docking with grid-based energy evaluation. J Comput Chem 13:505–524

    CAS  Google Scholar 

  7. 7.

    Jones G, Willett P, Glen RC et al (1997) Development and validation of a genetic algorithm for flexible docking. J Mol Biol 267:727–748

    CAS  PubMed  Google Scholar 

  8. 8.

    Morris GM, Goodsell DS, Halliday RS et al (1998) Automated docking using a lamarckian genetic algorithm and an empirical binding free energy function. J Comput Chem 19:1639–1662

    CAS  Google Scholar 

  9. 9.

    Ewing TJA, Makino S, Skillman AG, Kuntz ID (2001) DOCK 4.0: Search strategies for automated molecular docking of flexible molecule databases. J Comput Aided Mol Des 15:411–428

    CAS  PubMed  Google Scholar 

  10. 10.

    Pason LP, Sotriffer CA (2016) Empirical scoring functions for affinity prediction of protein–ligand complexes. Mol Inform 35:541–548

    CAS  PubMed  Google Scholar 

  11. 11.

    Wang R, Lai L, Wang S (2002) Further development and validation of empirical scoring functions for structure-based binding affinity prediction. J Comput Aided Mol Des 16:11–26

    CAS  PubMed  Google Scholar 

  12. 12.

    Molecular Operating Environment (MOE), version 2016; Chemical Computing Group Inc.: Montreal, QC, Canada (2016)

  13. 13.

    Li Y, Liu Z, Li J et al (2014) Comparative assessment of scoring functions on an updated benchmark: 1: Compilation of the test set. J Chem Inf Model 54:1700–1716

    CAS  PubMed  Google Scholar 

  14. 14.

    Thornton BF, Wik M, Crill PM (2016) Double-counting challenges the accuracy of high-latitude methane inventories. Geophys Res Lett 43:12569–12577

  15. 15.

    Verkhivker G, Appelt K, Freer ST, Villafranca JE (1995) Empirical free energy calculations of ligand-protein crystallographic complexes: I: Knowledge-based ligand-protein interaction potentials applied to the prediction of human immunodeficiency virus 1 protease binding affinity. Protein Eng Des Sel 8:677–691

    CAS  Google Scholar 

  16. 16.

    Krammer A, Kirchhoff PD, Jiang X et al (2005) LigScore: a novel scoring function for predicting binding affinities. J Mol Graph Model 23:395–407

    CAS  PubMed  Google Scholar 

  17. 17.

    Böhm HJ (1998) Prediction of binding constants of protein ligands: a fast method for the prioritization of hits obtained from de novo design or 3D database search programs. J Comput Aided Mol Des 12:309–323

    PubMed  Google Scholar 

  18. 18.

    Jain A (1996) Scoring noncovalent protein–ligand interactions: a continuous differentiable function tuned to compute binding affinities. J Comput Aided Mol Des 10:427–440

    CAS  PubMed  Google Scholar 

  19. 19.

    Eldridge MD, Murray CW, Auton TR et al (1997) Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J Comput Aided Mol Des 11:425–445

    CAS  PubMed  Google Scholar 

  20. 20.

    Baxter CA, Murray CW, Clark DE et al (1998) Flexible docking using Tabu search and an empirical estimate of binding affinity. Proteins Struct Funct Genet 33:367–382

    CAS  PubMed  Google Scholar 

  21. 21.

    Friesner RA, Banks JL, Murphy RB et al (2004) Glide: a new approach for rapid, accurate docking and scoring: 1: Method and assessment of docking accuracy. J Med Chem 47:1739–1749

    CAS  PubMed  Google Scholar 

  22. 22.

    Friesner RA, Murphy RB, Repasky MP et al (2006) Extra precision glide: Docking and scoring incorporating a model of hydrophobic enclosure for protein–ligand complexes. J Med Chem 49:6177–6196

    CAS  PubMed  Google Scholar 

  23. 23.

    Jain T, Jayaram B (2005) An all atom energy based computational protocol for predicting binding affinities of protein–ligand complexes. FEBS Lett 579:6659–6666

    CAS  PubMed  Google Scholar 

  24. 24.

    Gohlke H, Hendlich M, Klebe G (2000) Knowledge-based scoring function to predict protein–ligand interactions. J Mol Biol 295:337–356

    CAS  PubMed  Google Scholar 

  25. 25.

    Grzybowski BA, Ishchenko AV, Shimada J, Shakhnovich EI (2002) From knowledge-based potentials to combinatorial lead design in silico. Acc Chem Res 35:261–269

    CAS  PubMed  Google Scholar 

  26. 26.

    McQuarrie DA (1976) Statistical Mechanics

  27. 27.

    Chandler D, Percus JK (1987) Introduction to modern statistical mechanics. Oxford Univ Press, New York, doi 10(1063/1):2811680

    Google Scholar 

  28. 28.

    Huang S-Y, Zou X (2010) Advances and challenges in protein–ligand docking. Int J Mol Sci 11:3016–3034

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Liu J, Wang R (2015) Classification of current scoring functions. J Chem Inf Model 55:475–482

    CAS  PubMed  Google Scholar 

  30. 30.

    Zheng Z, Merz KM (2013) Development of the knowledge-based and empirical combined scoring algorithm (KECSA) to score protein–ligand interactions. J Chem Inf Model 53:1073–1083

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Velec HFG, Gohlke H, Klebe G (2005) DrugScoreCSD-knowledge-based scoring function derived from small molecule crystal data with superior recognition rate of near-native ligand poses and better affinity prediction. J Med Chem 48:6296–6303

    CAS  PubMed  Google Scholar 

  32. 32.

    Muegge I, Martin YC (1999) A general and fast scoring function for protein–ligand interactions: a simplified potential approach. J Med Chem 42:791–804

    CAS  PubMed  Google Scholar 

  33. 33.

    Muegge I (2000) A knowledge-based scoring function for protein–ligand interactions: probing the reference state. Perspect Drug Discov Des 20:99–114

    CAS  Google Scholar 

  34. 34.

    Muegge I (2006) PMF scoring revisited. J Med Chem 49:5895–5902

    CAS  PubMed  Google Scholar 

  35. 35.

    Mooij WTM, Verdonk ML (2005) General and targeted statistical potentials for protein–ligand interactions. Proteins Struct Funct Genet 61:272–287

    CAS  PubMed  Google Scholar 

  36. 36.

    DeWitte RS, Shakhnovich EI (1996) SMoG: De novo design method based on simple, fast, and accurate free energy estimates: 1: Methodology and supporting evidence. J Am Chem Soc 118:11733–11744

    CAS  Google Scholar 

  37. 37.

    Debroise T, Shakhnovich EI, Chéron N (2017) A Hybrid Knowledge-Based and Empirical Scoring Function for Protein-Ligand Interaction: SMoG2016. J Chem Inf Model 57:584–593

    CAS  PubMed  Google Scholar 

  38. 38.

    Huang SY, Zou X (2006) An iterative knowledge-based scoring function to predict protein–ligand interactions: I: Derivation of interaction potentials. J Comput Chem 27:1866–1875

    CAS  PubMed  Google Scholar 

  39. 39.

    Baum B, Muley L, Smolinski M et al (2010) Non-additivity of functional group contributions in protein–ligand binding: a comprehensive study by crystallography and isothermal titration calorimetry. J Mol Biol 397:1042–1054

    CAS  PubMed  Google Scholar 

  40. 40.

    Cheng T, Li Q, Zhou Z et al (2012) Structure-based virtual screening for drug discovery: a problem-centric review. AAPS J 14:133–141

    CAS  PubMed  PubMed Central  Google Scholar 

  41. 41.

    Ballester PJ, Mitchell JBO (2010) A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking. Bioinformatics 26:1169–1175

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Ballester PJ, Schreyer A, Blundell TL (2014) Does a more precise chemical description of protein–ligand complexes lead to more accurate prediction of binding affinity? J Chem Inf Model 54:944–955

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Li GB, Yang LL, Wang WJ et al (2013) ID-score: A new empirical scoring function based on a comprehensive set of descriptors related to protein–ligand interactions. J Chem Inf Model 53:592–600

    CAS  PubMed  Google Scholar 

  44. 44.

    Pires DEV, Ascher DB (2016) CSM-lig: a web server for assessing and comparing protein–small molecule affinities. Nucleic Acids Res 44:gkw390

  45. 45.

    Zilian D, Sotriffer CA (2013) SFCscoreRF: a random forest-based scoring function for improved affinity prediction of protein–ligand complexes. J Chem Inf Model 53:1923–1933

    CAS  PubMed  Google Scholar 

  46. 46.

    Wang C, Zhang Y (2017) Improving scoring-docking-screening powers of protein–ligand scoring functions using random forest. J Comput Chem 38:169–177

    PubMed  Google Scholar 

  47. 47.

    Wójcikowski M, Ballester PJ, Siedlecki P (2017) Performance of machine-learning scoring functions in structure-based virtual screening. Sci Rep 7:46710

    PubMed  PubMed Central  Google Scholar 

  48. 48.

    Li J, Fu A, Zhang L (2019) An overview of scoring functions used for protein–ligand interactions in molecular docking. Interdiscip Sci Comput Life Sci 11:320–328

    Google Scholar 

  49. 49.

    Wang J-C, Lin J-H (2013) Scoring functions for prediction of protein–ligand interactions. Curr Pharm Des 19:2174–2182

    CAS  PubMed  Google Scholar 

  50. 50.

    Cao Y, Li L (2014) Improved protein–ligand binding affinity prediction by using a curvature-dependent surface-area model. Bioinformatics 30:1674–1680

    CAS  PubMed  Google Scholar 

  51. 51.

    Ain QU, Aleksandrova A, Roessler FD, Ballester PJ (2015) Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening. Wiley Interdiscip Rev Comput Mol Sci 5:405–424

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Kramer C, Gedeck P (2010) Leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets. J Chem Inf Model 50:1961–1969

    CAS  PubMed  Google Scholar 

  53. 53.

    Li Y, Yang J (2017) Structural and sequence similarity makes a significant impact on machine-learning-based scoring functions for protein–ligand interactions. J Chem Inf Model 57:1007–1012

    CAS  PubMed  Google Scholar 

  54. 54.

    Gohlke H, Kiel C, Case DA (2003) Insights into protein–protein binding by binding free energy calculation and free energy decomposition for the Ras-Raf and Ras–RalGDS complexes. J Mol Biol 330:891–913

    CAS  PubMed  Google Scholar 

  55. 55.

    Homeyer N, Gohlke H (2012) Free energy calculations by the molecular mechanics Poisson-Boltzmann surface area method. Mol Inform 31:114–122

    CAS  Google Scholar 

  56. 56.

    Parenti MD, Rastelli G (2012) Advances and applications of binding affinity prediction methods in drug discovery. Biotechnol Adv 30:244–250

    CAS  PubMed  Google Scholar 

  57. 57.

    Kollman P (1993) Free-energy calculations—Applications to chemical and biochemical phenomena. Chem Rev 93:2395–2417

    CAS  Google Scholar 

  58. 58.

    Ytreberg FM, Swendsen RH, Zuckerman DM (2006) Comparison of free energy methods for molecular systems. J Chem Phys 125:184114

    PubMed  Google Scholar 

  59. 59.

    Aqvist J, Luzhkov VB, Brandsdal BO (2002) Ligand binding affinities from MD simulations. Acc Chem Res 35:358–365

    PubMed  Google Scholar 

  60. 60.

    Wang E, Sun H, Wang J et al (2019) End-point binding free energy calculation with MM/PBSA and MM/GBSA: strategies and applications in drug design. Chem Rev 119:9478–9508

    CAS  PubMed  Google Scholar 

  61. 61.

    Cheng T, Li X, Li Y et al (2009) Comparative assessment of Sscoring Functions on a diverse test set. J Chem Inf Model 49:1079–1093

    CAS  PubMed  Google Scholar 

  62. 62.

    Hartshorn MJ, Verdonk ML, Chessari G et al (2007) Diverse, high-quality test set for the validation of protein–ligand docking performance. J Med Chem 50:726–741

    CAS  PubMed  Google Scholar 

  63. 63.

    Jain T, Jayaram B (2007) Computational protocol for predicting the binding affinities of zinc containing metalloprotein–ligand complexes. Proteins Struct Funct Bioinforma 67:1167–1178

    CAS  Google Scholar 

  64. 64.

    Breiman L (2001) Random forests. Mach Learn 45:5–32

    Google Scholar 

  65. 65.

    Lu C-H, Lin Y-F, Lin J-J, Yu C-S (2012) Prediction of metal ion–binding sites in proteins using the fragment transformation method. PLoS ONE 7:e39252

    CAS  PubMed  PubMed Central  Google Scholar 

  66. 66.

    Liu Z, Li Y, Han L et al (2014) PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 31:405–412

    CAS  PubMed  Google Scholar 

  67. 67.

    Wang R, Fang X, Lu Y et al (2005) The PDBbind database: methodologies and updates. J Med Chem 48:4111–4119

    CAS  PubMed  Google Scholar 

  68. 68.

    Berman HM (2000) The protein data bank. Nucleic Acids Res 28:235–242

    CAS  PubMed  PubMed Central  Google Scholar 

  69. 69.

    Case DA, Cheatham TE, Darden T et al (2005) The Amber biomolecular simulation programs. J Comput Chem 26:1668–1688

    CAS  PubMed  PubMed Central  Google Scholar 

  70. 70.

    Word JM, Lovell SC, Richardson JS, Richardson DC (1999) Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J Mol Biol 285:1735–1747

    CAS  PubMed  Google Scholar 

  71. 71.

    Jakalian A, Jack DB, Bayly CI (2002) Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II: Parameterization and validation. J Comput Chem 23:1623–1641

    CAS  PubMed  Google Scholar 

  72. 72.

    Lindorff-Larsen K, Piana S, Palmo K et al (2010) Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins 78:1950–1958

    CAS  PubMed  PubMed Central  Google Scholar 

  73. 73.

    Wang J, Wolf RM, Caldwell JW et al (2004) Development and testing of a general amber force field. J Comput Chem 25:1157–1174

    CAS  Google Scholar 

  74. 74.

    Mulliken RS (1955) Electronic population analysis on LCAO–MO molecular wave functions. I J Chem Phys 23:1833–1840

    CAS  Google Scholar 

  75. 75.

    Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, Cheeseman JR, Scalmani G, Barone V, Mennucci B, Petersson GA, Nakatsuji H, Caricato M, Li X, Hratchian HP, Izmaylov AF, Bloino J, Zheng G, Sonnenber DJ (2009) Gaussian 09. Gaussian Inc, Wallingford, pp 2–3

    Google Scholar 

  76. 76.

    Stote RH, Karplus M (1995) Zinc binding in proteins and solution: a simple but accurate nonbonded representation. Proteins Struct Funct Genet 23:12–31

    CAS  PubMed  Google Scholar 

  77. 77.

    Ȧqvist J (1990) Ion-water interaction potentials derived from free energy perturbation simulations. J Phys Chem 94:8021–8024

    Google Scholar 

  78. 78.

    Aaqvist J, Warshel A (1990) Free energy relationships in metalloenzyme-catalyzed reactions: calculations of the effects of metal ion substitutions in staphylococcal nuclease. J Am Chem Soc 112:2860–2868

    CAS  Google Scholar 

  79. 79.

    Shahrokh K, Orendt A, Yost GS, Cheatham TE (2012) Quantum mechanically derived AMBER-compatible heme parameters for various states of the cytochrome P450 catalytic cycle. J Comput Chem 33:119–133

    CAS  PubMed  Google Scholar 

  80. 80.

    Arora N, Jayaram B (1998) Energetics of base pairs in B-DNA in solution: an appraisal of potential functions and dielectric treatments. J Phys Chem B 102:6139–6144

    CAS  Google Scholar 

  81. 81.

    Manning GS (1978) The molecular theory of polyelectrolyte solutions with applications to the electrostatic properties of polynucleotides. Q Rev Biophys 11:179–246

    CAS  PubMed  Google Scholar 

  82. 82.

    Cornell WD, Cieplak P, Bayly CI et al (1995) A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J Am Chem Soc 117:5179–5197

    CAS  Google Scholar 

  83. 83.

    Wesson L, Eisenberg D (2008) Atomic solvation parameters applied to molecular dynamics of proteins in solution. Protein Sci 1:227–235

    Google Scholar 

  84. 84.

    Eisenberg D, McLachlan AD (1986) Solvation energy in protein folding and binding. Nature 319:199–203

    CAS  Google Scholar 

  85. 85.

    Lee B, Richards FM (1971) The interpretation of protein structures: estimation of static accessibility. J Mol Biol 55:379–400

    CAS  PubMed  Google Scholar 

  86. 86.

    Finkelstein AV, Janin J (1989) The price of lost freedom: entropy of bimolecular complex formation. Protein Eng Des Sel 3:1–3

    CAS  Google Scholar 

  87. 87.

    Doig AJ, Sternberg MJE (1995) Side-chain conformational entropy in protein folding. Protein Sci 4:2247–2251

    CAS  PubMed  PubMed Central  Google Scholar 

  88. 88.

    Pickett SD, Sternberg MJE (1993) Empirical scale of side-chain conformational entropy in protein folding. J Mol Biol 231:825–839

    CAS  PubMed  Google Scholar 

  89. 89.

    Svetnik V, Liaw A, Tong C et al (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958

    CAS  Google Scholar 

  90. 90.

    Li H, Leung K-S, Wong M-H, Ballester PJ (2015) Improving autodock vina using random forest: the growing accuracy of binding affinity prediction by the effective exploitation of larger data sets. Mol Inform 34:115–126

    PubMed  Google Scholar 

  91. 91.

    Li H, Leung K-S, Wong M-H, Ballester PJ (2014) Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: cyscore as a case study. BMC Bioinformatics 15:291

    PubMed  PubMed Central  Google Scholar 

  92. 92.

    Li H, Leung K-S, Wong M-H, Ballester PJ (2015) The use of random forest to predict binding affinity in docking. In: Ortuño F, Rojas I (eds) Bioinformatics and Biomedical Engineering: Third International Conference, IWBBIO 2015, Granada, Spain, April 15–17, 2015. Proceedings, Part II. Springer International Publishing, Cham, pp 238–247

  93. 93.

    Li Y, Han L, Liu Z, Wang R (2014) Comparative assessment of scoring functions on an updated benchmark: 2: Evaluation methods and general results. J Chem Inf Model 54:1717–1736

    CAS  PubMed  Google Scholar 

  94. 94.

    Su M, Yang Q, Du Y et al (2019) Comparative assessment of scoring functions: the CASF-2016 update. J Chem Inf Model 59:895–913

    CAS  PubMed  Google Scholar 

  95. 95.

    Chen P, Ke Y, Lu Y et al (2019) DLIGAND2: an improved knowledge-based energy function for protein–ligand interactions using the distance-scaled, finite, ideal-gas reference state. J Cheminform 11:52

    PubMed  PubMed Central  Google Scholar 

  96. 96.

    John Lu ZQ (2010) The elements of statistical learning: data mining, inference, and prediction. J R Stat Soc Ser A 173:693–694

    Google Scholar 

  97. 97.

    Liu Q, Kwoh CK, Li J (2013) Binding affinity prediction for protein–ligand complexes based on β contacts and B factor. J Chem Inf Model 53:3076–3085

    CAS  PubMed  Google Scholar 

  98. 98.

    Ouyang X, Handoko SD, Kwoh CK (2011) Cscore: a simple yet effective scoring function for protein–ligand binding affinity prediction using modified CMAC learning architecture. J Bioinform Comput Biol 09:1–14

    Google Scholar 

  99. 99.

    Kramer C, Gedeck P (2011) Global free energy scoring functions based on distance-dependent atom-type pair descriptors. J Chem Inf Model 51:707–720

    CAS  PubMed  Google Scholar 

  100. 100.

    Ballester PJ, Mitchell JBO (2011) Comments on “leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets”: significance for the validation of scoring functions. J Chem Inf Model 51:1739–1741

    CAS  PubMed  Google Scholar 

Download references


Authors gratefully acknowledge support to SCFBio from the Department of Biotechnology, Govt. of India. The authors thank Dr. Prashant S. Rana for sharing his insights into the random forest and Mr. Manpreet Singh for web-enabling Bappl+.

Author information




AS, BJ conceived the project. AS performed all the calculations. RB helped in fine-tuning the work and in generating the web server. All authors analyzed the data and wrote the manuscript.

Corresponding author

Correspondence to B. Jayaram.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 1143 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Soni, A., Bhat, R. & Jayaram, B. Improving the binding affinity estimations of protein–ligand complexes using machine-learning facilitated force field method. J Comput Aided Mol Des 34, 817–830 (2020).

Download citation


  • Protein–ligand interactions
  • Binding affinity
  • Scoring functions
  • Random forest