Predictive Cheminformatics in Drug Discovery: Statistical Modeling for Analysis of Micro-array and Gene Expression Data

  • N. SukumarEmail author
  • Michael P. Krein
  • Mark J. Embrechts
Part of the Methods in Molecular Biology book series (MIMB, volume 910)


The vast amounts of chemical and biological data available through robotic high-throughput assays and micro-array technologies require computational techniques for visualization, analysis, and predictive ­modeling. Predictive cheminformatics and bioinformatics employ statistical methods to mine this data for hidden correlations and to retrieve molecules or genes with desirable biological activity from large databases, for the purpose of drug development. While many statistical methods are commonly employed and widely accessible, their proper use involves due consideration to data representation and preprocessing, model validation and domain of applicability estimation, similarity assessment, the nature of the structure-activity landscape, and model interpretation. This chapter seeks to review these considerations in light of the current state of the art in statistical modeling and to summarize the best practices in predictive cheminformatics.

Key words

Cheminformatics Bioinformatics QSAR Molecular modeling Molecular similarity Micro-array Data mining High-throughput screening 


  1. 1.
    Sukumar N, Krein M, Breneman CM (2008) Bio- and Chem-Informatics: where do the twain meet? Curr Opin Drug Discov Dev 11:311–319Google Scholar
  2. 2.
    Good BM, Wilkinson MD (2006) The life sciences semantic web is full of creeps. Brief Bioinform 7:275–286PubMedGoogle Scholar
  3. 3.
    Zimmermann M, Fluck J, Thi LTB et al (2005) Information extraction in the life sciences: perspectives for Med. Chem., pharmacology and toxicology. Curr Top Med Chem 5:785–796PubMedGoogle Scholar
  4. 4.
    Stevens R, Goble CA, Bechhofer S (2000) Ontology-based knowledge representation for bioinformatics. Brief Bioinform 1:398–414PubMedGoogle Scholar
  5. 5.
    Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29PubMedGoogle Scholar
  6. 6.
    Karp PD (2000) An ontology for biological function based on molecular interactions. Bioinformatics 16:269–285PubMedGoogle Scholar
  7. 7.
    Schuffenhauer A, Zimmermann J, Stoop R et al (2002) An ontology for pharmaceutical ligands and its application for in silico screening and library design. J Chem Inf Comput Sci 42:947–955PubMedGoogle Scholar
  8. 8.
    Schuffenhauer A, Jacoby E (2004) Annotating and mining the ligand–target chemogenomics knowledge space. Drug Discov Today 2:190–200Google Scholar
  9. 9.
    Bodenreider O, Stevens R (2007) Bio-ontologies: current trends and future directions. Brief Bioinform 7:256–274Google Scholar
  10. 10.
    Paolini GV, Shapland RHB, Hoorn WPv et al (2006) Global mapping of pharmacological space. Nat Biotechnol 24:805–815PubMedGoogle Scholar
  11. 11.
    Cronin MTD, Schultz TW (2003) Pitfalls in QSAR. J Mol Struct (Theochem) 622:39–51Google Scholar
  12. 12.
    Scior T, Medina-Franco JL, Do QT et al (2009) How to recognize and workaround pitfalls in QSAR studies: a critical review. Curr Med Chem 16:4297–4313PubMedGoogle Scholar
  13. 13.
    Zvinavashe E, Murk AJ, Rietjens IMCM (2008) Promises and pitfalls of quantitative structure–activity relationship approaches for predicting metabolism and toxicity. Chem Res Toxicol 21:2229–2236PubMedGoogle Scholar
  14. 14.
    Verma RP, Hansch C (2005) An approach toward the problem of outliers in QSAR. Bioorg Med Chem 13:4597–4621PubMedGoogle Scholar
  15. 15.
    Maggiora GM (2006) On outliers and activity cliffs—why QSAR often disappoints. J Chem Inf Model 46:1535PubMedGoogle Scholar
  16. 16.
    Casalegno M, Sello G, Benfenati E (2008) Definition and detection of outliers in chemical space. J Chem Inf Model 48:1592–1601PubMedGoogle Scholar
  17. 17.
    Guha R, Schürer S (2008) Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays. J Comput Aided Mol Des 22:367–384PubMedGoogle Scholar
  18. 18.
    Jaworska J, Nikolova-Jeliazkova N, Aldenberg T (2005) QSAR applicability domain estimation by projection of the training set in descriptor space: a review. Altern Lab Anim 33:445–459PubMedGoogle Scholar
  19. 19.
    Dimitrov S, Dimitrova G, Pavlov T et al (2005) A stepwise approach for defining the applicability domain of SAR and QSAR models. J Chem Inf Model 45:839–849PubMedGoogle Scholar
  20. 20.
    Golbraikh A, Tropsha A (2002) Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection. J Comput Aided Mol Des 16:357–369PubMedGoogle Scholar
  21. 21.
    Dearden JC, Cronin MTD, Kaiser KLE (2009) How not to develop a quantitative structure–activity or structure–property relationship (QSAR/QSPR). SAR QSAR Environ Res 20:241–266PubMedGoogle Scholar
  22. 22.
    Wold S, Dunn WJ (1983) Multivariate quantitative structure–activity relationships (QSAR): conditions for their applicability. J Chem Inf Comput Sci 23:6–13Google Scholar
  23. 23.
    Wold S, Ruhe A, Wold H et al (1984) The collinearity problem in linear regression. The Partial Least Squares (PLS) approach to generalized inverses. SIAM J Sci Stat Comput 5:735Google Scholar
  24. 24.
    Tropsha A, Gramatica P, Gombar VK (2003) The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci 22:69–77Google Scholar
  25. 25.
    Gramatica P (2007) Principles of QSAR models validation: internal and external. QSAR Comb Sci 26:694–701Google Scholar
  26. 26.
    Golbraikh A, Tropsha A (2002) Beware of q2. J Mol Graph Model 20:269–276PubMedGoogle Scholar
  27. 27.
    Cramer RD, Bunce JD, Patterson DE et al (1988) Crossvalidation, bootstrapping, and partial least squares compared with multiple regression in conventional QSAR studies. QSAR 7:18–25Google Scholar
  28. 28.
    Liu H, Papa E, Gramatica P (2006) QSAR prediction of estrogen activity for a large set of diverse chemicals under the guidance of OECD principles. Chem Res Toxicol 19:1540–1548PubMedGoogle Scholar
  29. 29.
    Tropsha A (2010) Best practices for QSAR model development, validation, and exploitation. Mol Inf 29:476–488Google Scholar
  30. 30.
    Rücker C, Rücker G, Meringer M (2007) y-Randomization and its variants in QSPR/QSAR. J Chem Inf Model 47:2345–2357PubMedGoogle Scholar
  31. 31.
    Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36Google Scholar
  32. 32.
    Lipinski CA, Lombardo F, Dominy BW et al (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 23:3–25Google Scholar
  33. 33.
    Sigrist CJA, Cerutti L, Hulo N et al (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265–274PubMedGoogle Scholar
  34. 34.
    Suzuki M (1994) A framework for the DNA–protein recognition code of the probe helix in transcription factors: the chemical and stereochemical rules. Structure 2:317–326PubMedGoogle Scholar
  35. 35.
    Suzuki M, Yagi N (1994) DNA recognition code of transcription factors in the helix-turn-helix, probe helix, hormone receptor and zinc finger families. Proc Natl Acad Sci USA 91:12357–12361PubMedGoogle Scholar
  36. 36.
    Mandel-Gutfreund Y, Margalit H (1998) Quantitative parameters for amino acid–base interaction: implications for prediction of protein–DNA binding sites. Nucleic Acids Res 26:2306–2312PubMedGoogle Scholar
  37. 37.
    Rao JKM (1987) New scoring matrix for amino acid residue exchanges based on ­residue characteristic physical parameters. Int J Pept Protein Res 29:276–281Google Scholar
  38. 38.
    Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185:862–864PubMedGoogle Scholar
  39. 39.
    Dayhoff MO (1978) Atlas of protein sequence and structure. National Biomedical Research Foundation, Silver Spring, MDGoogle Scholar
  40. 40.
    Nakai K, Kidera A, Kanehisa M (1988) Cluster analysis of amino acid indices for prediction of protein structure and function. Protein Eng 2:93–100PubMedGoogle Scholar
  41. 41.
    Tomii K, Kanehisa M (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng 9:27–36PubMedGoogle Scholar
  42. 42.
    Tung C-W, Ho S-Y (2007) POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties. Bioinformatics 23:942–949PubMedGoogle Scholar
  43. 43.
    Rausch C, Weber T, Kohlbacher O et al (2005) Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). Nucleic Acids Res 33:5799–5808PubMedGoogle Scholar
  44. 44.
    Sarda D, Chua GH, Li K-B et al (2005) pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinform 6:152Google Scholar
  45. 45.
    Mundra P, Kumar M, Kumar KK et al (2007) Using pseudo amino acid composition to predict protein subnuclear localization: approached with PSSM. Pattern Recognit Lett 28:1610–1615Google Scholar
  46. 46.
    Afonnikov DA, Kolchanov NA (2004) CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences. Nucleic Acids Res 32:W64–W68PubMedGoogle Scholar
  47. 47.
    Liu B, Li S, Wang Y et al (2007) Predicting the protein SUMO modification sites based on Properties Sequential Forward Selection (PSFS). Biochem Biophys Res Commun 358:136–139PubMedGoogle Scholar
  48. 48.
    Bannai H, Tamada Y, Maruyama O et al (2002) Extensive feature detection of N-terminal protein sorting signals. Bioinformatics 18:298–305PubMedGoogle Scholar
  49. 49.
    Dubchak I, Muchnick I, Mayor C et al (1999) Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. Proteins 35:401–407PubMedGoogle Scholar
  50. 50.
    Dubchak I, Muchnik I, Holbrook SR et al (1995) Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci USA 92:8700–8704PubMedGoogle Scholar
  51. 51.
    Cai CZ, Han LY, Ji ZL et al (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31:3692–3697PubMedGoogle Scholar
  52. 52.
    Cai YD, Liu XJ, Xu XB et al (2002) Support vector machines for predicting HIV protease cleavage sites in protein. J Comput Chem 23:267–274PubMedGoogle Scholar
  53. 53.
    Gao Q-B, Wang Z-Z, Yan C et al (2005) Prediction of protein subcellular location using a combined feature of sequence. FEBS Lett 579:3444PubMedGoogle Scholar
  54. 54.
    Grant JA, Haigh JA, Pickup BT et al (2006) Lingos, finite state machines and fast similarity searching. J Chem Inf Model 46:1912–1918PubMedGoogle Scholar
  55. 55.
    Melville JL, Riley JF, Hirst JD (2007) Similarity by compression. J Chem Inf Model 47:25–33PubMedGoogle Scholar
  56. 56.
    Randic M (2001) The connectivity index 25 years after. J Mol Graph Model 20:19–35PubMedGoogle Scholar
  57. 57.
    Rupp M, Proschak E, Schneider G (2007) Kernel approach to molecular similarity based on iterative graph similarity. J Chem Inf Model 47:2280–2286PubMedGoogle Scholar
  58. 58.
    Lin Z, Pan XM (2001) Accurate prediction of protein secondary structural content. J Protein Chem 20:217–220PubMedGoogle Scholar
  59. 59.
    Chou KC, Cai YD (2005) Prediction of membrane protein types by incorporating amphipathic effects. J Chem Inf Model 45:407–413PubMedGoogle Scholar
  60. 60.
    Bergeron C, Hepburn T, Sundling M et al. Prediction of peptide bonding affinity: kernel methods for nonlinear modeling.
  61. 61.
    Song M, Breneman CM, Bi J et al (2002) Prediction of protein retention times in anion-exchange chromatography systems using support vector regression. J Chem Inf Comput Sci 42:1347–1357PubMedGoogle Scholar
  62. 62.
    Mazza CB, Sukumar N, Breneman CM et al (2001) Prediction of protein retention in ion-exchange systems using molecular descriptors obtained from crystal structure. Anal Chem 73:5457–5461PubMedGoogle Scholar
  63. 63.
    Schneider G, Wrede P (1994) The rational design of amino acid sequences by artificial neural networks and simulated molecular ­evolution: de novo design of an idealized leader peptidase cleavage site. Biophys J 66:335–344PubMedGoogle Scholar
  64. 64.
    Cramer RD, Patterson DE, Bunce JD (1988) Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J Am Chem Soc 110:5959–5967PubMedGoogle Scholar
  65. 65.
    Rush TS, Grant JA, Mosyak L et al (2005) A shape-based 3-D scaffold hopping method and its application to a bacterial protein–protein interaction. J Med Chem 48:1489–1495PubMedGoogle Scholar
  66. 66.
    Masek BB, Merchant A, Matthew JB (1993) Molecular shape comparison of angiotensin II receptor antagonists. J Med Chem 36:1230–1238PubMedGoogle Scholar
  67. 67.
    Wagener M, Sadowski J, Gasteiger J (1995) Autocorrelation of molecular surface properties for modeling corticosteroid binding globulin and cytosolic Ah receptor activity by neural networks. J Am Chem Soc 117:7769–7775Google Scholar
  68. 68.
    Ballester PJ, Richards WG (2007) Ultrafast shape recognition to search compound databases for similar molecular shapes. J Comput Chem 28:1711–1723PubMedGoogle Scholar
  69. 69.
    Ballester PJ, Richards WG (2007) Ultrafast shape recognition for similarity search in molecular databases. Proc R Soc A 463:1307–1321Google Scholar
  70. 70.
    Breneman CM, Sundling CM, Sukumar N et al (2003) New developments in PEST shape/property hybrid descriptors. J Comput Aided Mol Des 17:231–240PubMedGoogle Scholar
  71. 71.
    Nagarajan K, Zauhar R, Welsh WJ (2005) Enrichment of ligands for the serotonin receptor using the shape signatures approach. J Chem Inf Model 45:49–57PubMedGoogle Scholar
  72. 72.
    Zauhar RJ, Moyna G, Tian L et al (2003) Shape signatures, a new approach to computer-aided ligand- and receptor-based drug design. J Med Chem 46:5674–5690PubMedGoogle Scholar
  73. 73.
    Pastor M, Cruciani G, McLay I et al (2000) GRid-INdependent Descriptors (GRIND): a novel class of alignment-independent three-dimensional molecular descriptors. J Med Chem 43:3233–3243PubMedGoogle Scholar
  74. 74.
    Clark T (2004) QSAR and QSPR based solely on surface properties? J Mol Graph Model 22:519–525PubMedGoogle Scholar
  75. 75.
    Ehresmann B, Groot MJd, Alex A et al (2004) New molecular descriptors based on local properties at the molecular surface and a boiling-point model derived from them. J Chem Inf Comput Sci 44:658–668PubMedGoogle Scholar
  76. 76.
    Ballester PJ, Finn PW, Richards WG (2009) Ultrafast shape recognition: evaluating a new ligand-based virtual screening technology. J Mol Graph Model 27:836–845PubMedGoogle Scholar
  77. 77.
    Steffen NR, Murphy SD, Tolleri L et al (2002) DNA sequence and structure: direct and indirect recognition in protein–DNA binding. Bioinformatics 18:22–30Google Scholar
  78. 78.
    Aida M (1998) An ab-initio molecular orbital study on the sequence-dependency of DNA conformation: an evaluation of intra- and inter-strand stacking interaction energy. J Theor Biol 130:327–335Google Scholar
  79. 79.
    Kono H, Sarai A (1999) Structure-based prediction of DNA target sites by regulatory proteins. Proteins 35:114–131PubMedGoogle Scholar
  80. 80.
    Pichierri F, Aida M, Gromiha MM et al (1999) Free-energy maps of base-amino acid interactions for DNA–protein recognition. J Am Chem Soc 121:6152–6157Google Scholar
  81. 81.
    Liu R, Blackwell TW, States DJ (2001) Conformational model for binding site recognition by the E. coli MetJ transcription factor. Bioinformatics 17:622–633PubMedGoogle Scholar
  82. 82.
    Whitehead CE, Breneman CM, Sukumar N et al (2003) Transferable atom equivalent multi-centered multipole expansion method. J Comput Chem 24:512–529PubMedGoogle Scholar
  83. 83.
    Sukumar N, Breneman CM (2007) QTAIM in drug discovery and protein modeling. In: Matta CF, Boyd RJ (eds) The quantum theory of atoms in molecules: from solid state to DNA and drug design. Wiley-VCH, Weinheim, pp 471–498Google Scholar
  84. 84.
    Johnson MA, Maggiora GM (1990) Concepts and applications of molecular similarity. Wiley, New York, NYGoogle Scholar
  85. 85.
    Martin YC, Kofron JL, Traphagen LM (2002) Do structurally similar molecules have similar biological activity? J Med Chem 45:4350–4358PubMedGoogle Scholar
  86. 86.
    Guha R, Van Drie JH (2008) Structure–activity landscape index: identifying and quantifying activity cliffs. J Chem Inf Model 48:646–658PubMedGoogle Scholar
  87. 87.
    Peltason L, Bajorath J (2007) SAR index: quantifying the nature of structure–activity relationships. J Med Chem 50:5571–5578PubMedGoogle Scholar
  88. 88.
    Peltason L, Iyer P, Bajorath J (2010) Rationalizing three-dimensional activity landscapes and the influence of molecular representations on landscape topology and the formation of activity cliffs. J Chem Inf Model 50:1021–1033PubMedGoogle Scholar
  89. 89.
    Bajorath J, Peltason L, Wawer M et al (2009) Navigating structure–activity landscapes. Drug Discov Today 14:698–705PubMedGoogle Scholar
  90. 90.
    Bredel M, Jacoby E (2004) Chemogenomics: an emerging strategy for rapid target and drug discovery. Nat Rev Genet 5:262–275PubMedGoogle Scholar
  91. 91.
    Mestres J (2004) Computational chemogenomics approaches to systematic knowledge-based drug discovery. Curr Opin Drug Discov Dev 7:304–313Google Scholar
  92. 92.
    Klabunde T (2007) Chemogenomic approaches to drug discovery: similar receptors bind similar ligands. Br J Pharmacol 152:5–7PubMedGoogle Scholar
  93. 93.
    Rognan D (2007) Chemogenomic approaches to rational drug design. Br J Pharmacol 152:38–52PubMedGoogle Scholar
  94. 94.
    Oloff S, Zhang S, Sukumar N et al (2006) Chemometric analysis of ligand receptor complementarity: identifying Complementary Ligands Based on Receptor Information (CoLiBRI). J Chem Inf Model 46:844–851PubMedGoogle Scholar
  95. 95.
    Xie L, Bourne PE (2008) Detecting evolutionary relationships across existing fold space, using sequence order-independent profile–profile alignments. Proc Natl Acad Sci USA 105:5441–5446PubMedGoogle Scholar
  96. 96.
    Ren J, Xie L, Li WW et al (2010) SMAP-WS: a parallel web service for structural proteome-wide ligand-binding site comparison. Nucleic Acids Res 38:W441–W444PubMedGoogle Scholar
  97. 97.
    Kinnings SL, Liu N, Buchmeier N et al (2009) Drug discovery using chemical systems biology: repositioning the safe medicine comtan to treat multi-drug and extensively drug resistant tuberculosis. PLoS Comput Biol 5:e1000423PubMedGoogle Scholar
  98. 98.
    Das S, Krein MP, Breneman CM (2010) Binding affinity prediction with property-encoded shape distribution signatures. J Chem Inf Model 50:298–308PubMedGoogle Scholar
  99. 99.
    Milletti F, Vulpetti A (2010) Predicting polypharmacology by binding site similarity: from kinases to the protein universe. J Chem Inf Model 50:1418–1431PubMedGoogle Scholar
  100. 100.
    Chen B, Wild DJ (2010) PubChem BioAssays as a data source for predictive models. J Mol Graph Model 28:420–426PubMedGoogle Scholar
  101. 101.
    Hopkins AL (2008) Network pharmacology: the next paradigm in drug discovery. Nat Chem Biol 4:682–690PubMedGoogle Scholar
  102. 102.
    Wawer M, Peltason L, Weskamp N et al (2008) Structure–activity relationship anatomy by network-like similarity graphs and local structure–activity relationship indices. J Med Chem 51:6075–6084PubMedGoogle Scholar
  103. 103.
    Fliri AF, Loging WT, Thadeio PF et al (2005) Biological spectra analysis: linking biological activity profiles to molecular structure. Proc Nat Acad Sci USA 102:261–266Google Scholar
  104. 104.
    Kauvar LM, Higgins DL, Villar HO et al (1995) Predicting ligand binding to proteins by affinity fingerprinting. Chem Biol 2:107–118PubMedGoogle Scholar
  105. 105.
    Krejsa C, Horvath D, Rogalski S et al (2003) Predicting ADME properties and side effects: the BioPrint approach. Curr Opin Drug Discov Dev 6:470–480Google Scholar
  106. 106.
    Stanforth RW, Kolossov E, Mirkin B (2007) A measure of domain of applicability for QSAR modelling based on intelligent K-means clustering. QSAR Comb Sci 26:837–844Google Scholar
  107. 107.
    Kerzic D, Blazic BJ, Batagelj V (1994) Comparison of three different approaches to the property prediction problem. J Chem Inf Comput Sci 34:391–394Google Scholar
  108. 108.
    Bennett K, Demiriz A, Embrechts M (1999) Semi-supervised clustering using genetic algorithms. Artif Neural Networks Eng 14:809–814Google Scholar
  109. 109.
    Rose VS, Croall IF, Macfie HJH (1991) An application of unsupervised neural network methodology kohonen topology-preserving mapping to QSAR analysis. QSAR 10:6–15Google Scholar
  110. 110.
    Eriksson L, Andersson P, Johansson E et al (2006) Megavariate analysis of environmental QSAR data. Part I: A basic framework founded on principal component analysis (PCA), partial least squares (PLS), and statistical molecular design (SMD). Mol Divers 10:169–186PubMedGoogle Scholar
  111. 111.
    Guha R (2008) On the interpretation and interpretability of quantitative structure–activity relationship models. J Comput Aided Mol Des 22:857–871PubMedGoogle Scholar
  112. 112.
    Topliss JG, Edwards RP (1979) Chance factors in studies of quantitative-structure property relationships. J Med Chem 22:1238–1244PubMedGoogle Scholar
  113. 113.
    Hoskuldson A (1988) PLS regression methods. J Chemometrics 2:211Google Scholar
  114. 114.
    Geladi P (1988) Notes on the history and nature of Partial Least-Squares (PLS) modelling. J Chemometrics 2:231Google Scholar
  115. 115.
    Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537PubMedGoogle Scholar
  116. 116.
    King R, Hirst J, Sternberg M (1993) New approaches to QSAR: neural networks and machine learning. Perspect Drug Discov Des 1:279–290Google Scholar
  117. 117.
    Huuskonen J, Salo M, Taskinen J et al (1997) Neural network modeling for estimation of the aqueous solubility of structurally related drugs. J Pharm Sci 86:450–454PubMedGoogle Scholar
  118. 118.
    Livingstone DJ, Manallack DT, Tetko IV (1997) Data modelling with neural networks: advantages and limitations. J Comput Aided Mol Des 11:135–142PubMedGoogle Scholar
  119. 119.
    Bruce CL, Melville JL, Pickett SD et al (2007) Contemporary QSAR classifiers compared. J Chem Inf Model 47:219–227PubMedGoogle Scholar
  120. 120.
    Myles AJ, Feudale RN, Liu Y et al (2004) An introduction to decision tree modeling. J Chemometrics 18:275–285Google Scholar
  121. 121.
    Carvalho DR, Freitas AA (2004) A hybrid decision tree/genetic algorithm method for data mining. Inf Sci 163:13–35Google Scholar
  122. 122.
    Dudek AZ, Arodz T, Galvez J (2006) Computational methods in developing Quantitative Structure–Activity Relationships (QSAR): a review. Comb Chem High Throughput Screen 9:213–228PubMedGoogle Scholar
  123. 123.
    Hou T, Wang J, Zhang W et al (2006) ADME evaluation in drug discovery. 7. Prediction of oral absorption by correlation and classification. J Chem Inf Model 47:208–218Google Scholar
  124. 124.
    Breiman L (2001) Random forests. Mach Learn 45:5–32Google Scholar
  125. 125.
    Svetnik V, Liaw A, Tong C et al (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958PubMedGoogle Scholar
  126. 126.
    Tong W, Hong H, Fang H et al (2003) Decision forest: combining the predictions of multiple independent decision tree models. J Chem Inf Comput Sci 43:525–531PubMedGoogle Scholar
  127. 127.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297Google Scholar
  128. 128.
    Scholkopf B, Smola AJ, Williamson RC et al (2000) New support vector algorithms. Neural Comput 12:1207–1245PubMedGoogle Scholar
  129. 129.
    Rosipal R, Trejo LJ (2001) Kernel partial least squares regression in reproducing kernel Hilbert space. Mach Learn Res 2:97–123Google Scholar
  130. 130.
    Bennett K, Campbell C (2000) Support vector machines: hype or hallelujah. SIGKDD Explor 2:1–13Google Scholar
  131. 131.
    Embrechts MJ, Arciniegas FA, Ozdemir M et al. (2001) Bagging neural network sensitivity analysis for feature reduction in QSAR problems. In: 2001 INNS—IEEE International Joint Conference on Neural Networks, IEEE Press, Washington, DC, pp 2478–2482Google Scholar
  132. 132.
    Bakken GA, Jurs PC (2000) Classification of multidrug-resistance reversal agents using structure-based descriptors and linear discriminant analysis. J Med Chem 43:4534–4541PubMedGoogle Scholar
  133. 133.
    Bennett K, Demiriz A (2000) Optimization approaches to semi-supervised learning. In: Ferris MC, Mangasarian OL, Pang JS (eds) Applications and algorithms of complementarity. Kluwer Academic, BostonGoogle Scholar
  134. 134.
    Burbidge R, Trotter M, Buxton B et al (2001) Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput Chem 26:5–14PubMedGoogle Scholar
  135. 135.
    Czerminski R, Yasri A, Hartsough D (2001) Use of support vector machine in pattern classification: application to QSAR studies. QSAR 20:227–240Google Scholar
  136. 136.
    Bennett KP, Embrechts MJ (2003) An optimization perspective on partial least squares. In: Suykens JAK, Horvath G, Basu S, Micchelli C, Vandewalle J (eds) Advances in learning theory: methods, models and applications. IOS, Amsterdam, pp 227–250Google Scholar
  137. 137.
    Embrechts MJ, Robert Kewley J, Breneman C (1998) Computationally intelligent data mining for the automated design and discovery of novel pharmaceuticals. In: Dagli CH, Akay M, Buczak AL, Ersoy O, Fernandex BR (eds) Smart engineering systems: neural networks, fuzzy logic, evolutionary programming, data mining and rough sets, 1st edn. ASME, St. Louis, MO, pp 397–403Google Scholar
  138. 138.
    Mazzatorta P, Benfenati E, Neagu D et al (2002) The importance of scaling in data mining for toxicity prediction. J Chem Inf Comput Sci 42:1250–1255PubMedGoogle Scholar
  139. 139.
    Roy PP, Leonard JT, Roy K (2008) Exploring the impact of size of training sets for the development of predictive QSAR models. Chemometrics Intell Lab Syst 90:31–42Google Scholar
  140. 140.
    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410, PubMedGoogle Scholar
  141. 141.
    Nikolova N, Jaworska J (2003) Approaches to measure chemical similarity—a review. QSAR Comb Sci 22:1006–1026Google Scholar
  142. 142.
    Embrechts M, Breneman CM, Arciniegas F et al (2001) Data mining using 2-D neural network sensitivity analysis for molecules. In: Dagli CH (ed) Intelligent engineering systems through artificial neural networks: smart engineering system design. ASME, New York, NYGoogle Scholar
  143. 143.
    Shao L, Wu L, Fan X et al (2010) Consensus ranking approach to understanding the underlying mechanism with QSAR. J Chem Inf Model 50:1941–1948PubMedGoogle Scholar
  144. 144.
    Swets JA, Dawes RM, Monahan J (2000) Better decisions through science. Sci Am 283:82–87PubMedGoogle Scholar
  145. 145.
    Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874Google Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • N. Sukumar
    • 1
    Email author
  • Michael P. Krein
    • 1
  • Mark J. Embrechts
    • 2
  1. 1.Rensselaer Exploratory Center for Cheminformatics Research and Department of Chemistry and Chemical BiologyRensselaer Polytechnic InstituteTroyUSA
  2. 2.Department of Industrial and Systems EngineeringRensselaer Polytechnic InstituteTroyUSA

Personalised recommendations