Abstract
It is well known that the structure of a molecule is responsible for its biological activity or physicochemical property. Here, we describe the role of machine learning (ML)/statistical methods for building reliable, predictive models in chemoinformatics. The ML methods are broadly divided into clustering, classification and regression techniques. However, the statistical/mathematical techniques which are part of the ML tools, such as artificial neural networks, hidden Markov models, support vector machine, decision tree learning, Random Forest and Naive Bayes and belief networks, are best suited for drug discovery and play an important role in lead identification and lead optimization steps. This chapter provides stepwise procedures for building ML-based classification and regression models using state-of-art open-source and proprietary tools. A few case studies using benchmark data sets have been carried out to demonstrate the efficacy of the ML-based classification for drug designing.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Breiman L (2001) Statistical modeling: the two cultures. Stat Sci 16(3):199–231
Murphy RF (2011) An active role for machine learning in drug development. Natl Chem Biol 7:327–330. doi:10.1038/nchembio.576
Gramatica P (2007) Principles of QSAR models validation: internal and external. QSAR Comb Sci 26:694–701
Tropsha A, Gramatica P, Gombar V (2003) The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci 22:69–77
Devillers J (2004) Prediction of mammalian toxicity of organophosphorus pesticides from QSTR modeling. SAR QSAR Environ Res 15:501–510
Okey RW, Stensel DH (1993) A QSBR development procedure for aromatic xenobiotic degradation by unacclimated bacteria. Water Environ Res 65(6):772–780
Sahigara F, Mansouri K, Ballabio D et al (2012) Comparison of different approaches to define the applicability domain of QSAR models. Molecules (Basel, Switzerland) 17:4791–4810
Cao DS, Liang YZ, Xu QS et al (2010) A new strategy of outlier detection for QSAR/QSPR. J Comput Chem 31:592–602
Clarke B, Fokoue E, Zhang HH (2009) Principles and theory for data mining and machine learning. J Am Stat Assoc 106(493):379–380
Michie D, Spiegelhalter DJ, Taylor CC, Campbell J (1995) Machine learning, neural and statistical classification. Overseas press, New York
Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. Informatica 31:249–268
Handfield LF, Chong YT, Simmons J, Andrews BJ, Moses AM (2013) Unsupervised clustering of subcellular protein expression patterns in high-throughput microscopy images reveals protein complexes and functional relationships between proteins. PLoS Comput Biol 9(6):e1003085. doi:10.1371/journal.pcbi.1003085
Maetschke SR, Madhamshettiwar PB, Davis MJ, Ragan MA (2013) Supervised, semi-supervised and unsupervised inference of gene regulatory networks. Brief Bioinforma. doi:10.1093/bib/bbt034
Sun Y, Peng Y, Chen Y, Shukla AJ (2003) Application of artificial neural networks in the design of controlled release drug delivery systems. Adv Drug Deliv Rev 55(9):1201–1215
Kisi O, Guven A (2010) Evapotranspiration modeling using linear genetic programming technique. J Irrig Drain Eng 136(10):715–723
Kirew DB, Chretien JR, Bernard P, Ros F (1998) Application of Kohonen neural networks in classification of biologically active compounds. SAR QSAR Envssss Res 8:93–107
Klon AE (2009) Bayesian modeling in virtual high throughput screening. Comb Chem High Throughput Screen 12:469–483
Olivas R (2007) Decision trees: a primer for decision-making professionals
Statnikov A, Wang L, Aliferis CF (2008) A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC bioinforma 9:319
Svetnik V, Liaw A, Tong C (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958
Breiman L (2001) Random forests. Mach Learn 45:5–32
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273
Scholkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, p 626
Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167
Hofmann T, Scholkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220
Nalbantov G, Groenen PJF, Bioch JC (2005) Support vector regression basics 13(1):1–19
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(27):1–27
Pyka M, Balz A, Jansen A et al (2012) A WEKA interface for fMRI data. Neuroinformatics 10:409–413. doi:10.1007/s12021-012-9144-3
Kuhn M, Weston S, Keefer C, Coulter N (2013) C code for Cubist by Ross Quinlan. Packaged: 2013-01–31
Sela RJ, Simonoff JS (2011) RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn 86:169–207. doi:10.1007/s10994-011-5258-3
http://cran.r-project.org/web/packages/kernlab/vignettes/kernlab.pdf
Ouyang Z, Clyde MA, Wolpert RL (2008) Bayesian kernel regression and classification, bayesian model selection and objective methods. Gainesville, NC
Karthikeyan M, Glen RC (2005) General melting point prediction based on a diverse compound data set and artificial neural networks. J Chem Inf Mod 45:581–590
Molecular Operating Environment (MOE) (2012) Chemical Computing Group Inc., 1010 Montreal, QC, Canada, H3A 2R7, 2012
Rosenblatt F (1962) Principles of neurodynamics: perceptrons and the theory of brain mechanisms. Spartan Books, Michigan
Park J, Sandberg IW (1991) Universal approximation using radial-basis-function networks. Neural Comput 3:246–257
Koza JR (1990) Genetic programming: a paradigm for genetically breeding populations of computer programs to solve problems. Stanford University, Stanford
Tsoulos IG, Gavrilis D, Dermatas E (2006) GDF: a tool for function estimation through grammatical evolution. Comput Phys Commun 174(7):555–559
Poli R, Langdon WB, McPhee NF (2008) A field guide to genetic programming (With contributions by Koza JR). Lulu enterprises. http://lulu.com, http://www.gp-field-guide.org.uk
Kotanchek M (2006) Symbolic regression via genetic programming for nonlinear data modeling. In: Abstracts, 38th central regional meeting of the American Chemical Society, Frankenmuth, MI, United States, 16–20 May 2006, CRM–160
Goldberg DE (1989) Genetic algorithms in search optimization and machine learning. Pearson Education, Boston
Koza JR, Poli R (2003) A genetic programming tutorial. In: Burke E (ed) Introductory tutorials in optimization, search and decision support. http://www.genetic-programming.com/jkpdf/burke2003tutorial.pdf
Gasteiger J (2001) Data mining in drug design. In: Hoeltje H-D, Sippl W (eds) Rational approaches to drug design: proceedings of the 13th European symposium on quantitative structure-activity relationships, Duesseldorf, Germany, pp 459-474, Aug. 27–Sept. 1 2000
Terfloth L, Gasteiger J (2001) Neural networks and genetic algorithms in drug design. Drug Discov Today 6(15):102–108
Hennessy K, Madden MG, Conroy J, Ryder AG (2005) An improved genetic programming technique for the classification of Raman spectra. Knowl-Based Syst 18:217–224
Barmpalexis P, Kachrimanis K, Tsakonas A, Georgarakis E (2011) Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation. Chemom Intell Lab Syst 107:75–82
Hou TJ, Zhang W, Xia K, Qiao XB, Xu XJ (2004) ADME evaluation in drug discovery. 5. correlation of caco-2 permeation with simple molecular properties. J Chem Inf Comput Sci 44:1585–1600
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536
Tambe SS, Kulkarni BD, Deshpande PB (1996) Elements of artificial neural networks with selected applications in chemical engineering, and chemical & biological sciences. Simulation & Advanced Controls, Louisville
Geladi P, Kowalski BR (1986) Partial least squares regression (PLS): a tutorial. Analytica Chimica Acta 85:1–17
Scholkopf B, Smola A, Klaus-Robert Muller KR (1998) Nonlinear component analysis as a Kernel Eigen value Problem. Neural Comput 10(5):1299–1319
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2014 Springer India
About this chapter
Cite this chapter
Karthikeyan, M., Vyas, R. (2014). Machine Learning Methods in Chemoinformatics for Drug Discovery. In: Practical Chemoinformatics. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1780-0_3
Download citation
DOI: https://doi.org/10.1007/978-81-322-1780-0_3
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-1779-4
Online ISBN: 978-81-322-1780-0
eBook Packages: Chemistry and Materials ScienceChemistry and Material Science (R0)