Machine Learning Methods in Chemoinformatics for Drug Discovery

  • Muthukumarasamy Karthikeyan
  • Renu Vyas


It is well known that the structure of a molecule is responsible for its biological activity or physicochemical property. Here, we describe the role of machine learning (ML)/statistical methods for building reliable, predictive models in chemoinformatics. The ML methods are broadly divided into clustering, classification and regression techniques. However, the statistical/mathematical techniques which are part of the ML tools, such as artificial neural networks, hidden Markov models, support vector machine, decision tree learning, Random Forest and Naive Bayes and belief networks, are best suited for drug discovery and play an important role in lead identification and lead optimization steps. This chapter provides stepwise procedures for building ML-based classification and regression models using state-of-art open-source and proprietary tools. A few case studies using benchmark data sets have been carried out to demonstrate the efficacy of the ML-based classification for drug designing.


Machine learning Neural networks SVM SVR Genetic programming Chemoinformatics Drug design 


  1. 1.
    Breiman L (2001) Statistical modeling: the two cultures. Stat Sci 16(3):199–231CrossRefGoogle Scholar
  2. 2.
    Murphy RF (2011) An active role for machine learning in drug development. Natl Chem Biol 7:327–330. doi:10.1038/nchembio.576CrossRefGoogle Scholar
  3. 3.
    Gramatica P (2007) Principles of QSAR models validation: internal and external. QSAR Comb Sci 26:694–701CrossRefGoogle Scholar
  4. 4.
    Tropsha A, Gramatica P, Gombar V (2003) The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci 22:69–77CrossRefGoogle Scholar
  5. 5.
    Devillers J (2004) Prediction of mammalian toxicity of organophosphorus pesticides from QSTR modeling. SAR QSAR Environ Res 15:501–510CrossRefGoogle Scholar
  6. 6.
    Okey RW, Stensel DH (1993) A QSBR development procedure for aromatic xenobiotic degradation by unacclimated bacteria. Water Environ Res 65(6):772–780CrossRefGoogle Scholar
  7. 7.
    Sahigara F, Mansouri K, Ballabio D et al (2012) Comparison of different approaches to define the applicability domain of QSAR models. Molecules (Basel, Switzerland) 17:4791–4810CrossRefGoogle Scholar
  8. 8.
    Cao DS, Liang YZ, Xu QS et al (2010) A new strategy of outlier detection for QSAR/QSPR. J Comput Chem 31:592–602Google Scholar
  9. 9.
    Clarke B, Fokoue E, Zhang HH (2009) Principles and theory for data mining and machine learning. J Am Stat Assoc 106(493):379–380Google Scholar
  10. 10.
    Michie D, Spiegelhalter DJ, Taylor CC, Campbell J (1995) Machine learning, neural and statistical classification. Overseas press, New YorkGoogle Scholar
  11. 11.
    Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. Informatica 31:249–268Google Scholar
  12. 12.
    Handfield LF, Chong YT, Simmons J, Andrews BJ, Moses AM (2013) Unsupervised clustering of subcellular protein expression patterns in high-throughput microscopy images reveals protein complexes and functional relationships between proteins. PLoS Comput Biol 9(6):e1003085. doi:10.1371/journal.pcbi.1003085CrossRefGoogle Scholar
  13. 13.
    Maetschke SR, Madhamshettiwar PB, Davis MJ, Ragan MA (2013) Supervised, semi-supervised and unsupervised inference of gene regulatory networks. Brief Bioinforma. doi:10.1093/bib/bbt034Google Scholar
  14. 14.
    Sun Y, Peng Y, Chen Y, Shukla AJ (2003) Application of artificial neural networks in the design of controlled release drug delivery systems. Adv Drug Deliv Rev 55(9):1201–1215CrossRefGoogle Scholar
  15. 15.
    Kisi O, Guven A (2010) Evapotranspiration modeling using linear genetic programming technique. J Irrig Drain Eng 136(10):715–723CrossRefGoogle Scholar
  16. 16.
    Kirew DB, Chretien JR, Bernard P, Ros F (1998) Application of Kohonen neural networks in classification of biologically active compounds. SAR QSAR Envssss Res 8:93–107CrossRefGoogle Scholar
  17. 17.
    Klon AE (2009) Bayesian modeling in virtual high throughput screening. Comb Chem High Throughput Screen 12:469–483CrossRefGoogle Scholar
  18. 18.
    Olivas R (2007) Decision trees: a primer for decision-making professionalsGoogle Scholar
  19. 19.
    Statnikov A, Wang L, Aliferis CF (2008) A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC bioinforma 9:319CrossRefGoogle Scholar
  20. 20.
    Svetnik V, Liaw A, Tong C (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958CrossRefGoogle Scholar
  21. 21.
    Breiman L (2001) Random forests. Mach Learn 45:5–32CrossRefGoogle Scholar
  22. 22.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273Google Scholar
  23. 23.
    Scholkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, p 626Google Scholar
  24. 24.
    Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167CrossRefGoogle Scholar
  25. 25.
    Hofmann T, Scholkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220CrossRefGoogle Scholar
  26. 26.
    Nalbantov G, Groenen PJF, Bioch JC (2005) Support vector regression basics 13(1):1–19Google Scholar
  27. 27.
    Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(27):1–27CrossRefGoogle Scholar
  28. 28.
  29. 29.
    Pyka M, Balz A, Jansen A et al (2012) A WEKA interface for fMRI data. Neuroinformatics 10:409–413. doi:10.1007/s12021-012-9144-3CrossRefGoogle Scholar
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
    Kuhn M, Weston S, Keefer C, Coulter N (2013) C code for Cubist by Ross Quinlan. Packaged: 2013-01–31Google Scholar
  35. 35.
    Sela RJ, Simonoff JS (2011) RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn 86:169–207. doi:10.1007/s10994-011-5258-3CrossRefGoogle Scholar
  36. 36.
  37. 37.
    Ouyang Z, Clyde MA, Wolpert RL (2008) Bayesian kernel regression and classification, bayesian model selection and objective methods. Gainesville, NCGoogle Scholar
  38. 38.
  39. 39.
    Karthikeyan M, Glen RC (2005) General melting point prediction based on a diverse compound data set and artificial neural networks. J Chem Inf Mod 45:581–590CrossRefGoogle Scholar
  40. 40.
  41. 41.
  42. 42.
    Molecular Operating Environment (MOE) (2012) Chemical Computing Group Inc., 1010 Montreal, QC, Canada, H3A 2R7, 2012Google Scholar
  43. 43.
  44. 44.
  45. 45.
  46. 46.
    Rosenblatt F (1962) Principles of neurodynamics: perceptrons and the theory of brain mechanisms. Spartan Books, MichiganGoogle Scholar
  47. 47.
    Park J, Sandberg IW (1991) Universal approximation using radial-basis-function networks. Neural Comput 3:246–257CrossRefGoogle Scholar
  48. 48.
  49. 49.
    Koza JR (1990) Genetic programming: a paradigm for genetically breeding populations of computer programs to solve problems. Stanford University, StanfordGoogle Scholar
  50. 50.
    Tsoulos IG, Gavrilis D, Dermatas E (2006) GDF: a tool for function estimation through grammatical evolution. Comput Phys Commun 174(7):555–559Google Scholar
  51. 51.
    Poli R, Langdon WB, McPhee NF (2008) A field guide to genetic programming (With contributions by Koza JR). Lulu enterprises.,
  52. 52.
    Kotanchek M (2006) Symbolic regression via genetic programming for nonlinear data modeling. In: Abstracts, 38th central regional meeting of the American Chemical Society, Frankenmuth, MI, United States, 16–20 May 2006, CRM–160Google Scholar
  53. 53.
    Goldberg DE (1989) Genetic algorithms in search optimization and machine learning. Pearson Education, BostonGoogle Scholar
  54. 54.
    Koza JR, Poli R (2003) A genetic programming tutorial. In: Burke E (ed) Introductory tutorials in optimization, search and decision support.
  55. 55.
    Gasteiger J (2001) Data mining in drug design. In: Hoeltje H-D, Sippl W (eds) Rational approaches to drug design: proceedings of the 13th European symposium on quantitative structure-activity relationships, Duesseldorf, Germany, pp 459-474, Aug. 27–Sept. 1 2000Google Scholar
  56. 56.
    Terfloth L, Gasteiger J (2001) Neural networks and genetic algorithms in drug design. Drug Discov Today 6(15):102–108CrossRefGoogle Scholar
  57. 57.
    Hennessy K, Madden MG, Conroy J, Ryder AG (2005) An improved genetic programming technique for the classification of Raman spectra. Knowl-Based Syst 18:217–224CrossRefGoogle Scholar
  58. 58.
    Barmpalexis P, Kachrimanis K, Tsakonas A, Georgarakis E (2011) Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation. Chemom Intell Lab Syst 107:75–82Google Scholar
  59. 59.
  60. 60.
  61. 61.
    Hou TJ, Zhang W, Xia K, Qiao XB, Xu XJ (2004) ADME evaluation in drug discovery. 5. correlation of caco-2 permeation with simple molecular properties. J Chem Inf Comput Sci 44:1585–1600CrossRefGoogle Scholar
  62. 62.
    Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536CrossRefGoogle Scholar
  63. 63.
    Tambe SS, Kulkarni BD, Deshpande PB (1996) Elements of artificial neural networks with selected applications in chemical engineering, and chemical & biological sciences. Simulation & Advanced Controls, LouisvilleGoogle Scholar
  64. 64.
    Geladi P, Kowalski BR (1986) Partial least squares regression (PLS): a tutorial. Analytica Chimica Acta 85:1–17CrossRefGoogle Scholar
  65. 65.
    Scholkopf B, Smola A, Klaus-Robert Muller KR (1998) Nonlinear component analysis as a Kernel Eigen value Problem. Neural Comput 10(5):1299–1319CrossRefGoogle Scholar

Copyright information

© Springer India 2014

Authors and Affiliations

  1. 1.Digital Information Resource CentreNational Chemical LaboratoryPuneIndia
  2. 2.Scientist (DST) Division of Chemical Engineering and Process DevelopmentNational Chemical LaboratoryPuneIndia

Personalised recommendations