Advances in the Application of Machine Learning Techniques in Drug Discovery, Design and Development

  • S. J. Barrett
  • W. B. Langdon
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 36)


Machine learning tools, in particular support vector machines (SVM), Particle Swarm Optimisation (PSO) and Genetic Programming (GP), are increasingly used in pharmaceuticals research and development. They are inherently suitable for use with ‘noisy’, high dimensional (many variables) data, as is commonly used in cheminformatic (i.e. In silico screening), bioinformatic (i.e. bio-marker studies, using DNA chip data) and other types of drug research studies. These aspects are demonstrated via review of their current usage and future prospects in context with drug discovery activities.


Support Vector Machine Particle Swarm Optimiza Feature Selection Particle Swarm Drug Discovery 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Agrafiotis and Cedeno, 2002. Feature selection for structure-activity correlation using binary particle swarms. Journal of Medicinal Chemistry, 45(5): 1098–1107.CrossRefGoogle Scholar
  2. Amboise and McLachlan 2002. selection bias in gene extraction on the basis of micro array gene-expression data. PNAS, 99(10):6562–6566CrossRefGoogle Scholar
  3. Ando and Iba, 2004. Classification of gene expression profile using combinatory method of evolutionary computation and machine learning. GP&EM, 5(2): 145–156.Google Scholar
  4. Arimoto and Gifford, 2005. Development of CYP3A4 Inhibition Models: Comparisons of Machine-Learning Techniques and Molecular Descriptors. Journal of Biomolecular Screening, 10(3):197–205CrossRefGoogle Scholar
  5. Bains et al., 2004. HERG binding specificity and binding site structure: Evidence from a fragment-based evolutionary computing SAR study. Progress in Biophysics and Molecular Biology, 86(2):205–233.CrossRefGoogle Scholar
  6. Banzhaf, et al., 1998. Genetic Programming An Introduction; On the Automatic Evolution of Computer Programs and its Applications; Morgan Kaufmann.Google Scholar
  7. Bao and Sun, 2002. Identifying genes related to drug anticancer mechanisms using support vector machine. FEBS Lett. 521(1–3):109–14.CrossRefGoogle Scholar
  8. Barrett, SJ. (2005) INTErSECT “RoCKET”: Robust Classification and Knowledge Engineering Techniques. Presented at: ‘Through Collaboration to Innovation’, Centre for Advanced Instrumentation Systems, UCL, 16th February 2005.Google Scholar
  9. Bhasin and Raghava, 2004a. GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic acids research, 32:W383–W389CrossRefGoogle Scholar
  10. Bhasin and Raghava, 2004b. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J. Biological Chemistry, 279(22):23262–23266CrossRefGoogle Scholar
  11. Biesheuvel, 2005. Diagnostic Research: improvements in design and analysis. PhD thesis, Universiteit Utrecht, Holland.Google Scholar
  12. Bock and Gough, 2003. Whole-proteome interaction mining. Bioinformatics, 19(1), 125–135.CrossRefGoogle Scholar
  13. Boser et al., 1992. A training algorithm for optimal margin classifiers. 5th Annual ACM Workshop, COLT, 1992Google Scholar
  14. Breiman, 2001. Random forests. Machine Learning, 45:5–32zbMATHCrossRefGoogle Scholar
  15. Breneman 2002. Caco-2 Permeability Modeling: Feature Selection via Sparse Support Vector Machines.Presented at the ADMEffox symposium at the Orlando ACS meeting, ApriI2002.Google Scholar
  16. Brown et al., 2000. Knowledge-based analysis of micro array gene expression data by using support vector machines. Proc. Natl, Acad. Sci., USA 97:262–267CrossRefGoogle Scholar
  17. Burbidge et al., 2001a. STAR Sparsity Through Automated Rejection. In Connectionist Models of Neurons, Learning Processes, and Artificial Intelligence: 6th International Work Conference On Artificial and Natural Neural Networks, IWANN 2001, Proceedings, Part 1, Vol. 2084; Mira, J.; Prieto, A., Eds.; Springer: Granada, Spain, 2001.Google Scholar
  18. Burbidge et al., 2001b. Drug design by machine learning: support vector machines for pharmaceutical data analysis. Computers in chemistry, 26(1):4–15Google Scholar
  19. Butte, 2002. The use and analysis of micro array data. Nat. Rev. Drug Discov. 1(12):951–60CrossRefGoogle Scholar
  20. Byvatov, and Schneider, 2004. SVM-Based Feature Selection for Characterization of Focused Compound Collections. J. Chern. Inf. Comput. Sci., 44(3): 993–999Google Scholar
  21. Byvatov et al., 2005a. From Virtual to Real Screening for D3 Dopamine Receptor Ligands. ChemBioChem, 6(6):997–999CrossRefGoogle Scholar
  22. Cedeno and Agrafiotis, 2003. Using particle swarms for the development of QSAR models based on K-nearest neighbor and kernel regression. J.Comput.-Aided Mol. Des.,17:255–263.CrossRefGoogle Scholar
  23. Chen, 2004. Support vector machine in chemistry. World Scientific, ISBN 9812389229Google Scholar
  24. Cheng et al., 2004. Insight into the Bioactivity and Metabolism of Human Glucagon Receptor Antagonists from 3D-QSAR Analyses. QSAR & Combinatorial Science, 23(8): 603–620CrossRefGoogle Scholar
  25. Congdon and Septor, 2003. Phylogenetic trees using evolutionary search: Initial progress in extending gaphyl to work with genetic data. CEC, pp320–326.Google Scholar
  26. Cristianini and Shawe-Taylor, 2000. An Introduction to support vector machines and other kernel-based learning methods. Cambridge University Press ISBN: 0 521 78019 5Google Scholar
  27. Deutsch, 2003. Evolutionary algorithms for finding optimal gene sets in micro array prediction. Bioinformatics, 19(1):45–52.CrossRefGoogle Scholar
  28. Dobson & Doig 2005. Predicting enzyme class from protein structure without alignments. J. Mol. Biol., 345:187–199CrossRefGoogle Scholar
  29. Doniger et al., 2002. Predicting CNS Permeability of Drug Molecules: Comparison of Neural Network and Support Vector Machine Algorithms. J. of Computational Biol., 9(6): 849–864CrossRefGoogle Scholar
  30. Dubey et al., 2005. Support vector machines for learning to identify the critical positions of a protein. Journal of Theoretical Biology, 234(3):351–361CrossRefMathSciNetGoogle Scholar
  31. Fradkin, 2005. SVM in Analysis of Cross-Sectional Epidemiological Data. http://dimacs. rutgers. edu/SpecialYears/2002_EpidlEpidSeminarSlides/fradkin.pdfGoogle Scholar
  32. Eberhart and Hu, 1999. Human tremor analysis using particle swarm optimization. In CEC, pp1927–1930Google Scholar
  33. Eberhart, Kennedy and Shi, 2001, Swarm Intelligence, Morgan Kaufmann.Google Scholar
  34. Fujarewicz and Wiench, 2003. Selecting differentially expressed genes for colon tumor classification. Int. J. Appl. Math. Comput. Sci., 13(3):327–335zbMATHMathSciNetGoogle Scholar
  35. Fung and Mangasarian, 2004. A Feature Selection Newton Method for Support Vector Machine Classification. Computational Optimization and Applications 28(2): 185–202zbMATHCrossRefMathSciNetGoogle Scholar
  36. Furlanello et al., 2003. Entropy-Based Gene Ranking without Selection Bias for the Predictive Classification of Microarray Data. BMC Bioinformatics, 4:54–74.CrossRefGoogle Scholar
  37. Guo et al., 2005. A novel statistical ligand-binding site predictor: application to ATP-binding sites. Protein Engng., Design & Selection, 18(2):65–70CrossRefGoogle Scholar
  38. Guyon et al., 2002. Gene selection for cancer classification using support vector machines. Machine learning, 46(1–3):389–422CrossRefGoogle Scholar
  39. Hand, 1999. Statistics and data mining: intersecting disciplines. SIGKDD Explorations, 1: 16–19CrossRefGoogle Scholar
  40. Härdle and Moro, 2004. Survival Analysis with Support vector Machines. Talk at Universite Rene Descartes UFR Biomedicale, Paris ise/stat/personenlwh/talks/hae_mor_SVM_%20survival040324.pdfGoogle Scholar
  41. Heddad et al., 2004. Evolving regular expression-based sequence classifiers for protein nuclear localisation.In: Raidl, et al.eds., Applications of Evolutionary Computing,LNCS 3005, 31–40Google Scholar
  42. Hong and Cho, 2004. Lymphoma cancer classification using genetic programming with SNR features. In Keijzer, et aleeds., EuroGP, LNCS 3003, 78–88.Google Scholar
  43. Hou and Xu, 2004. Recent development and application of virtual screening in drug discovery: an overview. Current Pharmaceutical Design, 10: 1011–1033CrossRefGoogle Scholar
  44. Howard and Benson, 2003. Evolutionary computation method for pattern recognition of cisacting sites. Biosystems, 72(1–2):19–27.CrossRefGoogle Scholar
  45. Howley and Madden, 2005. The Genetic Kernel Support Vector Machine: Description and Evaluation”. Artificial Intelligence Review, to appear.Google Scholar
  46. Huang and Chen, 2005. Support vector machines in sonography: Application to decision making in the diagnosis of breast cancer. Clinical Imaging, 29(3):179–184CrossRefGoogle Scholar
  47. Igel, 2005. Multiobjective Model Selection for Support Vector Machines. In C. A. Coello Coello, E. Zitzler, and A. Hernandez Aguirre, editors, Proc. of the Third International Conference on Evolutionary Multi-Criterion Optimization (EMO 2005), LNCS 3410: 534–546Google Scholar
  48. Jerebko, et al., 2005. Support vector machines committee classification method for computeraided polyp detection in CT colonography. Acad. Radiol., 12(4): 479–486.CrossRefGoogle Scholar
  49. Johnson et al., 2003. Metabolic fingerprinting of salt-stressed tomatoes. Phytochemistry, 62(6): 919–928.CrossRefGoogle Scholar
  50. Jones, 1999. Genetic and evolutionary algorithms, in: Encyclopedia of Computational Chemistry, Wiley.Google Scholar
  51. Jong et al., 2004. Analysis of Proteomic Pattern Data for Cancer Detection. In Applications of Evolutionary Computing. EvoBIO: Evolutionary Computation and Bioinformatics. Springer, 2004. LNCS, 3005: 41–51Google Scholar
  52. Jorissen and Gilson, 2005. Virtual Screening of Molecular Databases Using a Support Vector Machine. J. Chern. Inf. Model, 45(3): 549–561CrossRefGoogle Scholar
  53. Kell, 2002. Defence against the flood. Bioinformatics World, pp16–18.Google Scholar
  54. Kim et al., 2004. Prediction of phosphorylation sites using SVMs. Bioinformatics, 20: 3179–3184.CrossRefGoogle Scholar
  55. Kless and Eitrich, 2004. Cytochrome P450 Classification of Drugs with Support Vector Machines Implementing the Nearest Point Algorithm. LNAI, 3303:191–205Google Scholar
  56. Koza, 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection; MIT PressGoogle Scholar
  57. Koza et al., 2001. Reverse engineering of metabolic pathways from observed data using genetic programming. Pac. Symp. Biocomp, 2001, 434–435.Google Scholar
  58. Langdon and Barrett, 2004. Genetic programming in data mining for drug discovery. In Ghosh and Jain, eds., Evolutionary Computing in Data Mining, pp211–235. Springer.Google Scholar
  59. Langdon et al., 2001. Genetic programming for combining neural networks for drug discovery. In Roy, et al. eds., Soft Computing and Industry Recent Applications, 597–608. Springer. Published 2002.Google Scholar
  60. Langdon et al., 2002. Combining decision trees and neural networks for drug discovery. In Foster, et al. eds., EuroGP, LNCS 2278, 60–70.Google Scholar
  61. Langdon et al., 2003a. Comparison of AdaBoost and genetic programming for combining neural networks for drug discovery. In Raidl, et al. eds., Applications of Evolutionary Computing, LNCS 2611, pp87–98.Google Scholar
  62. Li et al., 2005. Degree prediction of malignancy in brain glioma using support vector machines. Computers in Biology and Medicine, In Press.Google Scholar
  63. Li et al., 2005b. A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. Genomics, 85(1): 16–23.CrossRefGoogle Scholar
  64. Lin et al., 2005. Piecewise hypersphere modeling by particle swarm optimization in QSAR studies of bioactivities of chemical compounds. J. Chern. Inf. Model., 45(3):535–541.CrossRefGoogle Scholar
  65. Listgarten et al., 2004. Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms. Clin. Cancer Res., 10: 2725–2737CrossRefGoogle Scholar
  66. Liu et al., 2004. QSAR and classification models of a novel series of COX-2 selective inhibitors: 1, 5-diarylimidazoles based on support vector machines. Journal of Computer-Aided Molecular Design 18(6): 389–399CrossRefGoogle Scholar
  67. Liu et al., 2005. Preclinical in vitro screening assays for drug-like properties. Drug Discovery Today: Technologies, 2(2): 179–185CrossRefGoogle Scholar
  68. Lu et al., 2004. QSAR analysis of cyclooxygenase inhibitor using particle swarm optimization and multiple linear regression. J. Pharm. Biomed. Anal., 35:679–687.CrossRefGoogle Scholar
  69. Malossini et al., 2004. Assessment of SVM reliability for microarrays data analysis. In: proc. 2nd European Workshop on data mining and text mining for bioinformatics, Pisa, Italy, Sept. 2004.Google Scholar
  70. Merkwirth et al., 2004. Ensemble Methods for Classification in Cheminformatics. J. Chern. Inf. Comput. Sci., 44(6): 1971–1978Google Scholar
  71. Miwakeichi et al., 2001. A comparison of non-linear non-parametric models for epilepsy data. Computers in Biology and Medicine, 31(1): 41–57Google Scholar
  72. Moore and Hahn, 2004. An improved grammatical evolution strategy for hierarchical petri net modeling of complex genetic systems. In Raidl, et al. eds., Applications of Evolutionary Computing, LNCS 3005, pp63–72.Google Scholar
  73. Moore et al., 2002. Symbolic discriminant analysis of microarray data in automimmune disease. Genetic Epidemiology, 23:57–69.CrossRefGoogle Scholar
  74. Muchnik, 2004. Influences on Breast Cancer Survival via SVM Classification in the SEER Database. Scholar
  75. Ng, 2004. Drugs-From Discovery to Approval. Wiley, New Jersey. ISBN: 0-471-60150-0Google Scholar
  76. Nicolott i et al., 2002. Multiob jective optimization in quantitative structure-activity relationships: Deriving accurate and interpretable QSARs. Journal of Medicinal Chemistry, 45(23):5069–5080.CrossRefGoogle Scholar
  77. Norinder, 2003. Support vector machine models in drug design: applications to drug transport processes and QSAR using simplex optimisations and variable selection. Neurocomputing, 55(1–2): 337–346CrossRefGoogle Scholar
  78. Ooi and Tan, 2003. Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics, 19(1):37–44.CrossRefGoogle Scholar
  79. Prados et al., 2004. Mining mass spectra for diagnosis and biomarker discovery of cerebral accidents Proteomics, 4(8): 2320–2332CrossRefGoogle Scholar
  80. Ratti and Trist, 2001. Continuing evolution of the drug discovery process in the pharmaceutical industry. Pure Appl. Chern.. 73(1):67–75CrossRefGoogle Scholar
  81. Reif et al., 2004. Integrated analysis of genetic, genomic, and proteomic data. Expert Review of Proteomics, 1(1):67–75.CrossRefMathSciNetGoogle Scholar
  82. Roses, 2002. Genome-based pharmacogenetics and the pharmaceutical industry. Nat. Rev. Drug Discov. 1(7):541–9CrossRefGoogle Scholar
  83. Runarsson and Sigurdsson, 2004. Asynchronous parallel evolutionary model selection for support vector machines. Neural Information Processing — Lett. & Reviews, 3(3):59–67Google Scholar
  84. Saigo et al., 2004. Protein homology detection using string alignment kernels Bioinformatics, 20: 1682–1689.CrossRefGoogle Scholar
  85. Schneider and Fechner, 2004. Advances in the prediction of protein targeting signals Proteomics, 4(6): 1571–1580CrossRefGoogle Scholar
  86. Schneider & Fechner, 2005. Computer-based de novo design of drug-like molecules. Nat. Rev. Drug Discovery, 4(8):649–663CrossRefGoogle Scholar
  87. Schrattenholz, 2004. Proteomics: how to control highly dynamic patterns of millions of molecules and interpret changes correctly? Drug Discovery Today: Technologies, 1(1): 1–8CrossRefGoogle Scholar
  88. Sebag et al., 2004. ROC-based Evolutionary Learning: Application to Medical Data Mining. Artificial Evolution’ 03, 384–396 Springer-verlag, LNCSGoogle Scholar
  89. Seike, et al., 2004. Proteomic signature of human cancer cells. Proteomics, 4(9): 2776–2788CrossRefGoogle Scholar
  90. Shawe-Taylor and Cristianini, 2000. An introduction to support vector machines. CUP.Google Scholar
  91. Shen et al., 2004. Hybridized particle swarm algorithm for adaptive structure training of multilayer feed-forward neural network: QSAR studies of bioactivity of organic compounds. Journal of Computational Chemistry, 25:1726–1735.CrossRefGoogle Scholar
  92. Shyu et al., 2004. Multiple sequence alignment with evolutionary computation. GP&EM, 5(2): 121–144.Google Scholar
  93. Simek et al., 2004. Using SVD and SVM methods for selection, classification, clustering and modeling of DNA microarray data. Engineering Applications of Artificial Intelligence, 17: 417–427CrossRefGoogle Scholar
  94. Smits et al., 2005. Variable selection in industrial datasets using pareto genetic programming. In Yu, et al. eds., Genetic Programming Theory and Practice III. Kluwer.Google Scholar
  95. Solmajer and Zupan, 2004. Optimisation algorithms and natural computing in drug discovery. Drug Discovery Today: Technologies, 1(3): 247–252CrossRefGoogle Scholar
  96. Suwa et al., 2004. GPCR and G-protein Coupling Selectivity Prediction Based on SVM with Physico-Chemical Parameters. GIW 2004 Poster Abstract: P056. journaIlGIW04/GIW04Poster.htmlGoogle Scholar
  97. Takahashi et al., 2005. Identification of Dopamine Dl Receptor Agonists and Antagonists under Existing Noise Compounds by TFS-based ANN and SVM. J. Comput. Chern. Jpn., 4(2): 43–48CrossRefGoogle Scholar
  98. Takaoka et al., 2003. Development of a Method for Evaluating Drug-Likeness and Ease of Synthesis Using a Data Set in Which Compounds Are Assigned Scores Based on Chemists’ Intuition. J. Chern. Inf. Comput. Sci., 43(4): 1269–1275.MathSciNetGoogle Scholar
  99. Teramoto et al., 2005. Prediction of siRNA functionality using generalized string kernel and support vector machine. FEBS Lett. 579(13):2878–82CrossRefGoogle Scholar
  100. Thukral et al., 2005. Prediction of Nephrotoxicant Action and Identification of Candidate Toxicity-Related Biomarkers. Toxicologic Pathology, 33(3): 343–355CrossRefGoogle Scholar
  101. Tobita et al., 2005. A discriminant model constructed by the support vector machine method for HERG potassium channel inhibitors Bioorganic & Medicinal Chemistry Letters, 15:2886–2890CrossRefGoogle Scholar
  102. Vinayagam et al., 2004. Appplying support vector machines for gene ontology based gene function prediction. BMC Bioinformatics. 5:116–129CrossRefGoogle Scholar
  103. Tsai and Wang, 2005. Evolutionary optimization with data collocation for reverse engineering of biological networks. Bioinformatics, 21(7): 1180–1188.CrossRefGoogle Scholar
  104. Vapnik, V. N. The Nature of Statistical Learning Theory; Springer: New York, 1995.zbMATHGoogle Scholar
  105. Wachowiak et al., 2004. An approach to multimodal biomedical image registration utilizing particle swarm optimization. IEEE Trans on EC, 8(3):289–301.MathSciNetGoogle Scholar
  106. Wang et al., 2004. Particle swarm optimization and neural network application for QSAR. In HiCOMB.Google Scholar
  107. Wang et al., 2005. Gene selection from micro array data for cancer classification — a machine learning approach. Computational Biology and Chemistry, 29(1): 37–46zbMATHCrossRefGoogle Scholar
  108. Warmuth et al., 2003. Active Learning with Support Vector Machines in the Drug Discovery Process. J. Chern. Inf. Comput. Sci., 43(2): 667–673Google Scholar
  109. Watkins and German, 2002. Metabolomics and biochemical profiling in drug discovery and development. Curro Opin. Mol. Ther., 4(3): 224–8Google Scholar
  110. Xiao et al., 2003. Gene clustering using self-organizing maps and particle swarm optimization. In HiCOMBGoogle Scholar
  111. Xu and Hagler 2002. Chemoinformatics and drug discovery. Molecules, 7: 566–600CrossRefGoogle Scholar
  112. Xue et al., 2004a. Prediction of P-Glycoprotein Substrates by a Support Vector Machine Approach. J. Chern. Inf. Comput. Sci. 44(4): 1497–1505Google Scholar
  113. Xue, et al., 2004b. QSAR Models for the Prediction of Binding Affinities to Human Serum Albumin Using the Heuristic Method and a Support Vector Machine. J. Chern. Inf. Comput. Sci., 44(5): 1693–1700Google Scholar
  114. Yang and Chou, 2004. Bio-support vector machines for computational proteomics. Bioinformatics, 20: 735–741.CrossRefGoogle Scholar
  115. Yap and Chen, 2005. Prediction of Cytochrome P450 3A4, 2D6, and 2C9 Inhibitors and Substrates by Using Support Vector Machines. J. Chern. Inf. Model, To appear.Google Scholar
  116. Yap et al., 2004. Prediction of Torsade-Causing Potential of Drugs by Support Vector Machine Approach. Toxicol. Sci., 79: 170–177CrossRefGoogle Scholar
  117. Yoon et al., 2003. Analysis of Multiple Single Nucleotide Polymorphisms of Candidate Genes Related to Coronary Heart Disease Susceptibility by Using Support Vector Machines. Clinical Chemistry and Laboratory Medicine, 41(4): 529–534.CrossRefGoogle Scholar
  118. Zhao et al., 2004. Diagnosing anorexia based on partial least squares, back-propagation neural network, and support vector machines. J. Chern. Inf. Sci. 44, 2040–2046.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • S. J. Barrett
    • 1
  • W. B. Langdon
    • 2
  1. 1.Analysis Applications, Research and TechnologiesGlaxoSmithKline R&DGreenford, MiddlesexUK
  2. 2.Computer ScienceUniversity of EssexColchesterUK

Personalised recommendations