Skip to main content

Feature Selection Investigation in Machine Learning Docking Scoring Functions

  • Conference paper
  • First Online:
Advances in Bioinformatics and Computational Biology (BSB 2023)


The in silico evaluation of small molecules (ligands) and receptors (proteins) interactions is of great importance, especially in Drug Design. This is one of the principal computational methodologies that can be incorporated into the process of proposing new drugs, with the aim of reducing the high financial costs and time involved. In this context, molecular docking is a computer simulation procedure used to predict the best conformation and orientation of a ligand in the binding site of a target protein. These docking algorithms evaluate the protein-ligand complex interactions using scoring functions (SF). SF computationally quantify the complex binding affinity and can be divided into categories according to the methodology applied in their development: Physics-based, Empirical, Knowledge-based and Machine Learning. Machine Learning (ML) scoring functions train the SF considering features obtained from known protein-ligand complexes and experimental affinities. These SF rely heavily on the set of attributes that are used to train them. Thus, in this work, we use PCA, ANOVA and Random Forest to investigate how these feature selection methods impact the performance of three Machine Learning scoring functions trained with Support Vector Machines, Elastic Net Regularization and Neural Networks algorithms. The results show that Neural Networks can greatly benefit from Feature selection performed by Random Forests but not from ANOVA and PCA. The conclusions are that Feature selection can improve the results of regression and in this study Neural Networks combined with Random Forest is the best option.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.

    In this paper, features and descriptors are treated as synonyms.


  1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  2. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)

    Google Scholar 

  3. Cock, P.J.A., et al.: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11), 1422–1423 (2009)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Crampon, K., Giorkallos, A., Deldossi, M., Baud, S., Steffenel, L.A.: Machine-learning methods for ligand-protein molecular docking. Drug Discovery Today 27(1), 151–164 (2022)

    Article  CAS  PubMed  Google Scholar 

  5. Durrant, J.D., McCammon, J.A.: NNScore: a neural-network-based scoring function for the characterization of protein-ligand complexes. J. Chem. Inf. Model. 50(10), 1865–1871 (2010)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Durrant, J.D., McCammon, J.A.: BINANA: a novel algorithm for ligand-binding characterization. J. Mol. Graph. Model. 29(6), 888–893 (2011)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Eberhardt, J., Santos-Martins, D., Tillack, A.F., Forli, S.: AutoDock vina 1.2. 0: new docking methods, expanded force field, and python bindings. J. Chem. Inf. Model. 61(8), 3891–3898 (2021)

    Google Scholar 

  8. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    Google Scholar 

  9. Han, J., Pei, J., Tong, H.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2022)

    Google Scholar 

  10. Hans, C.: Elastic net regression modeling with the orthant normal prior. J. Am. Stat. Assoc. 106(496), 1383–1393 (2011)

    Article  CAS  Google Scholar 

  11. Ishwaran, H., Lu, M.: Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat. Med. 38(4), 558–582 (2019)

    Article  PubMed  Google Scholar 

  12. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983).

  13. Kumar, M., Rath, N.K., Swain, A., Rath, S.K.: Feature selection and classification of microarray data using MapReduce based ANOVA and K-nearest neighbor. Procedia Comput. Sci. 54, 301–310 (2015)

    Article  Google Scholar 

  14. Kundu, I., Paul, G., Banerjee, R.: A machine learning approach towards the prediction of protein-ligand binding affinity based on fundamental molecular properties. RSC Adv. 8(22), 12127–12137 (2018)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Kuntz, I.D.: Structure-based strategies for drug design and discovery. Science 257(5073), 1078–1082 (1992)

    Article  CAS  PubMed  Google Scholar 

  16. Landrum, G.: RDKit documentation. Release 1(1–79), 4 (2013)

    Google Scholar 

  17. Li, Y., et al.: Comparative assessment of scoring functions on an updated benchmark: 1. Compilation of the test set. J. Chem. Inf. Model. 54(6), 1700–1716 (2014)

    Google Scholar 

  18. Liu, J., Wang, R.: Classification of current scoring functions. J. Chem. Inf. Model. 55(3), 475–482 (2015)

    Article  CAS  PubMed  Google Scholar 

  19. Liu, Z., et al.: Forging the basis for developing protein-ligand interaction scoring functions. Acc. Chem. Res. 50(2), 302–309 (2017)

    Article  CAS  PubMed  Google Scholar 

  20. Lybrand, T.P.: Ligand-protein docking and rational drug design. Curr. Opin. Struct. Biol. 5(2), 224–228 (1995)

    Article  CAS  PubMed  Google Scholar 

  21. Mahapatra, M.K., Karuppasamy, M.: Fundamental considerations in drug design. In: Computer Aided Drug Design (CADD): From Ligand-Based Methods to Structure-Based Approaches, pp. 17–55. Elsevier (2022)

    Google Scholar 

  22. Morris, G.M., et al.: AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 30(16), 2785–2791 (2009)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Onodera, K., Satou, K., Hirota, H.: Evaluations of molecular docking programs for virtual screening. J. Chem. Inf. Model. 47(4), 1609–1618 (2007)

    Article  CAS  PubMed  Google Scholar 

  24. Pearson, K.: Principal components analysis. London Edinburgh Dublin Philosophical Mag. J. Sci. 6(2), 559 (1901)

    Google Scholar 

  25. Pedregosa, F., et al.: scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  26. Piñero, J., Furlong, L.I., Sanz, F.: In silico models in drug development: where we are. Curr. Opin. Pharmacol. 42, 111–121 (2018)

    Article  PubMed  Google Scholar 

  27. Su, M., et al.: Comparative assessment of scoring functions: the CASF-2016 update. J. Chem. Inf. Model. 59(2), 895–913 (2018)

    Article  PubMed  Google Scholar 

  28. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson (2016)

    Google Scholar 

  29. Trott, O., Olson, A.J.: AutoDock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31(2), 455–461 (2010)

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Wang, C., Zhang, Y.: Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest. J. Comput. Chem. 38(3), 169–177 (2017)

    Article  PubMed  Google Scholar 

  31. Wang, R., Fang, X., Lu, Y., Wang, S.: The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J. Med. Chem. 47(12), 2977–2980 (2004)

    Article  CAS  PubMed  Google Scholar 

  32. Wang, S.C.: Artificial neural network. In: Interdisciplinary Computing in Java Programming, pp. 81–100. Springer, Boston (2003).

  33. Yang, C., Chen, E.A., Zhang, Y.: Protein-ligand docking in the machine-learning era. Molecules 27(14), 4568 (2022)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Yap, C.W.: PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 32(7), 1466–1474 (2011)

    Article  CAS  PubMed  Google Scholar 

Download references


The authors acknowledge the financial support given by CAPES Financial Code 001, CNPq grants 439582/2018-0 and 440363/2022-5, and FAPERGS processes 22/2551-0000385-0 and 22/2551-0000390-7.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Maurício Dorneles Caldeira Balboni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Balboni, M.D.C., Arrua, O.E., Werhli, A.V., dos Santos Machado, K. (2023). Feature Selection Investigation in Machine Learning Docking Scoring Functions. In: Reis, M.S., de Melo-Minardi, R.C. (eds) Advances in Bioinformatics and Computational Biology. BSB 2023. Lecture Notes in Computer Science(), vol 13954. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42714-5

  • Online ISBN: 978-3-031-42715-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics