Abstract
Molecular descriptors encode a wide variety of molecular information and have become the support of many contemporary chemoinformatic and bioinformatic applications. They grasp specific molecular features (e.g., geometry, shape, pharmacophores, or atomic properties) and directly affect computational models, in terms of outcome, performance, and applicability. This chapter aims to illustrate the impact of different molecular descriptors on the structural information captured and on the perceived chemical similarity among molecules. After introducing the fundamental concepts of molecular descriptor theory and application, a step-by-step retrospective virtual screening procedure guides users through the fundamental processing steps and discusses the impact of different types of molecular descriptors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Rocke AJ (1981) Kekulé, Butlerov, and the historiography of the theory of chemical structure. BJHS 14:27–57
Kekulé A (1858) Ueber die Constitution und die Metamorphosen der chemischen Verbindungen und über die chemische Natur des Kohlenstoffs. Eur J Org Chem 106:129–159
Crum-Brown A, Fraser T (1868) On the connection between chemical constitution and physiological action. Part 1. On the physiological action of the ammonium bases, derived from Strychia, Brucia, Thebaia, Codeia, Morphia and Nicotia. Trans R Soc Edinburgh 25:151–203
Richardson B (1869) Physiological research on alcohols. Med Times and Gazzette 2:703–706
Körner W (1874) Studi sulla Isomeria delle Così Dette Sostanze Aromatiche a Sei Atomi di Carbonio. Gazz Chim 4:242
Richet M (1893) Note sur le rapport entre la toxicité et les propriétés physiques des corps. C R Séances Soc Biol 45:775–776
Todeschini R, Consonni V (2009) Molecular descriptors for chemoinformatics, vol 2 volumes. Wiley-VCH, Weinheim
Kode SR (2016) Dragon (Software for Molecular Descriptor Calculation) Version 7.0–https://chm.kode-solutions.net
Moriguchi I, Hirono S, Nakagome I et al (1994) Comparison of reliability of log P values for drugs calculated by several methods. Chem Pharm Bull 42:976–978
Schneider G, Neidhart W, Giller T et al (1999) “Scaffold-hopping” by topological pharmacophore search: a contribution to virtual screening. Angew Chem Int Ed 38:2894–2896
Fechner U, Franke L, Renner S et al (2003) Comparison of correlation vector methods for ligand-based similarity searching. J Comput Aided Mol Des 17:687–698
Todeschini R, Consonni V, Gramatica P (2009) Chemometrics in QSAR. In: Comprehensive chemometrics. Elsevier, Oxford, pp 129–172
Johnson MA, Maggiora GM (1990) Concepts and applications of molecular similarity. Wiley, New York
Jacob L, Vert J-P (2008) Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 24:2149–2156
Rognan D (2007) Chemogenomic approaches to rational drug design. Br J Pharmacol 152:38–52
Strömbergsson H, Kleywegt GJ (2009) A chemogenomics view on protein-ligand spaces. BMC Bioinformatics 10:1–11
Bender A, Glen RC (2004) Molecular similarity: a key technique in molecular informatics. Org Biomol Chem 2:3204–3218
Consonni V, Todeschini R, Pavan M (2002) Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. 1. Theory of the novel 3D molecular descriptors. J Chem Inf Comput Sci 42:682–692
Eckert H, Bajorath J (2007) Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug Discov Today 12:225–233
Reutlinger M, Koch CP, Reker D et al (2013) Chemically advanced template search (CATS) for scaffold-hopping and prospective target prediction for “orphan” molecules. Mol Informatics 32:133–138
Geppert H, Vogt M, Bajorath J (2010) Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. J Chem Inf Model 50:205–216
Schneider G, Fechner U (2005) Computer-based de novo design of drug-like molecules. Nat Rev Drug Discov 4:649–663
Hajduk PJ, Greer J (2007) A decade of fragment-based drug design: strategic advances and lessons learned. Nat Rev Drug Discov 6:211–219
Miyao T, Kaneko H, Funatsu K (2016) Ring system-based chemical graph generation for de novo molecular design. J Comput Aided Mol Des 30:425–446
Mansouri K, Ringsted T, Ballabio D et al (2013) Quantitative structure–activity relationship models for ready biodegradability of chemicals. J Chem Inf Model 53:867–878
Grisoni F, Consonni V, Vighi M et al (2016) Expert QSAR system for predicting the bioconcentration factor under the REACH regulation. Environ Res 148:507–512
Chaudhry Q, Piclin N, Cotterill J et al (2010) Global QSAR models of skin sensitisers for regulatory purposes. Chem Cent J 4(S5):1–6
Grisoni F, Reker D, Schneider P et al (2017) Matrix-based molecular descriptors for prospective virtual compound screening. Mol Informatics 36:1600091
Tetko IV, Sushko I, Pandey AK et al (2008) Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. J Chem Inf Model 48:1733–1746
Zhu H, Tropsha A, Fourches D et al (2008) Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis. J Chem Inf Model 48:766–784
Brown JB, Niijima S, Shiraishi A, et al. (2012) Chemogenomic approach to comprehensive predictions of ligand-target interactions: a comparative study, In: 2012 I.E. International conference on bioinformatics and biomedicine workshops (BIBMW), pp. 136–142
Brown JB, Niijima S, Okuno Y (2013) Compound-protein interaction prediction within chemogenomics: theoretical concepts, practical usage, and future directions. Mol Informatics 32:906–921
Fujita T, Winkler DA (2016) Understanding the roles of the two QSARs. J Chem Inf Model 56:269–274
Grisoni F, Consonni V, Vighi M et al (2016) Investigating the mechanisms of bioconcentration through QSAR classification trees. Environ Int 88:198–205
Todeschini R, Consonni V (2008) Handbook of molecular descriptors. John Wiley & Sons, Weinheim
Consonni V, Todeschini R (2012) Multivariate analysis of molecular descriptors. In: Dehmer M, Varmuza K, Bonchev D (eds) Statistical modelling of molecular descriptors in QSAR/QSPR. Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, pp 111–147
Todeschini R, Consonni V (2008) Descriptors from molecular geometry. In: Gasteiger J (ed) Handbook of chemoinformatics: from data to knowledge, vol 4 Volumes. Wiley-VCH Verlag GmbH, Weinheim, Germany, pp 1004–1033
Nettles JH, Jenkins JL, Bender A et al (2006) Bridging chemical and biological space: “target fishing” using 2D and 3D molecular descriptors. J Med Chem 49:6802–6810
Schuur JH, Selzer P, Gasteiger J (1996) The coding of the three-dimensional structure of molecules by molecular transforms and its application to structure-spectra correlations and studies of biological activity. J Chem Inf Comput Sci 36:334–344
Finkelmann AR, Göller AH, Schneider G (2016) Robust molecular representations for modelling and design derived from atomic partial charges. Chem Commun 52:681–684
Rybinska A, Sosnowska A, Barycki M et al (2016) Geometry optimization method versus predictive ability in QSPR modeling for ionic liquids. J Comput Aided Mol Des 30:165–176
Nicklaus MC, Wang S, Driscoll JS et al (1995) Conformational changes of small molecules binding to proteins. Bioorg Med Chem 3:411–428
Goodford PJ (1985) A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. J Med Chem 28:849–857
Cramer RD, Patterson DE, Bunce JD (1988) Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J Am Chem Soc 110:5959–5967
Klebe G, Abraham U, Mietzner T (1994) Molecular similarity indices in a comparative analysis (CoMSIA) of drug molecules to correlate and predict their biological activity. J Med Chem 37:4130–4146
Hopfinger AJ, Wang S, Tokarski JS et al (1997) Construction of 3D-QSAR models using the 4D-QSAR analysis formalism. J Am Chem Soc 119:10509–10524
Andrade CH, Pasqualoto KFM, Ferreira EI et al (2010) 4D-QSAR: perspectives in drug design. Molecules 15:3281–3294
Vedani A, McMasters DR, Dobler M (2000) Multi-conformational ligand representation in 4D-QSAR: reducing the bias associated with ligand alignment. QSAR 19:149–161
Vedani A, Briem H, Dobler M et al (2000) Multiple-conformation and protonation-state representation in 4D-QSAR: the neurokinin-1 receptor system. J Med Chem 43:4416–4427
Vedani A, Dobler M (2002) 5D-QSAR: the key for simulating induced fit? J Med Chem 45:2139–2149
Vedani A, Dobler M, Lill MA (2005) Combining protein modeling and 6D-QSAR. Simulating the binding of structurally diverse ligands to the estrogen receptor. J Med Chem 48:3700–3703
Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50:1189–1204
Olah M, Rad R, Ostopovici L et al (2008) WOMBAT and WOMBAT-PK: bioactivity databases for lead and drug discovery. In: Chemical biology: from small molecules to systems biology and drug design, vol 1-3. Wiley-VCH, New York, pp 760–786
Young D, Martin T, Venkatapathy R et al (2008) Are the chemical structures in your QSAR correct? QSAR 27:1337–1345
Grisoni F, Consonni V, Villa S et al (2015) QSAR models for bioconcentration: is the increase in the complexity justified by more accurate predictions? Chemosphere 127:171–179
Mansouri K, Abdelaziz A, Rybacka A et al (2016) CERAPP: collaborative estrogen receptor activity prediction project. Environ Health Perspect 124:1023–1033
Mansouri K, Grulke CM, Richard AM et al (2016) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling. SAR QSAR Environ Res 27:911–937
Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11:1046–1053
Cassotti M, Grisoni F, Nembri S et al (2016) Application of the weighted power-weakness ratio (wPWR) as a fusion rule in ligand–based virtual screening. MATCH Comm Math Comp Chem 76:359–376
Nembri S, Grisoni F, Consonni V et al (2016) In silico prediction of cytochrome P450-drug interaction: QSARs for CYP3A4 and CYP2C9. Int J Mol Sci 17:914
Ewing T, Baber JC, Feher M (2006) Novel 2D fingerprints for ligand-based virtual screening. J Chem Inf Model 46:2423–2431
Watson P (2008) Naïve bayes classification using 2D pharmacophore feature triplet vectors. J Chem Inf Model 48:166–178
Klon AE, Diller DJ (2007) Library fingerprints: a novel approach to the screening of virtual libraries. J Chem Inf Model 47:1354–1365
Geppert H, Bajorath J (2010) Advances in 2D fingerprint similarity searching. Expert Opin Drug Discovery 5:529–542
Ballabio D, Consonni V, Mauri A et al (2014) A novel variable reduction method adapted from space-filling designs. Chemom Intell Lab Syst 136:147–154
Fodor IK (2002) A survey of dimension reduction techniques, Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24:417–441
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Lond Edinb Dubl Phil Mag 2:559–572
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32:241–254
Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, pp. 1027–1035
Todeschini R, Ballabio D, Consonni V (2015) Distances and other dissimilarity measures in chemometrics. In: Encyclopedia of analytical chemistry. John Wiley & Sons, Ltd, Hoboken, pp 1–34
Truchon J-F, Bayly CI (2007) Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J Chem Inf Model 47:488–508
Goldberg DE, Holland JH (1988) Genetic algorithms and machine learning. Mach Learn 3:95–99
Grisoni F, Cassotti M, Todeschini R (2014) Reshaped sequential replacement for variable selection in QSPR: comparison with other reference methods. J Chemom 28:249–259
Cassotti M, Grisoni F, Todeschini R (2014) Reshaped sequential replacement algorithm: an efficient approach to variable selection. Chemom Intell Lab Syst 133:136–148
Shen Q, Jiang J-H, Jiao C-X et al (2004) Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists. Eur J Pharm Sci 22:145–152
Derksen S, Keselman HJ (1992) Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables. Br J Math Stat Psychol 45:265–282
Cramer RD, Bunce JD, Patterson DE et al (1988) Crossvalidation, bootstrapping, and partial least squares compared with multiple regression in conventional QSAR studies. QSAR 7:18–25
Todeschini R, Ballabio D, Grisoni F (2016) Beware of unreliable Q2! A comparative study of regression metrics for predictivity assessment of QSAR models. J Chem Inf Model 56:1905–1913
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45:427–437
Berthold MR, Cebron N, Dill F et al (2009) KNIME - the Konstanz information miner: version 2.0 and beyond. SIGKDD Explor Newsl 11:26–31
Warr WA (2012) Scientific workflow systems: pipeline pilot and KNIME. J Comput Aided Mol Des 26:801–804
Python, https://www.python.org/
R: The R Project for Statistical Computing, https://www.r-project.org/
MATLAB (2016) R2016a, The MathWorks Inc., Natick, Massachusetts
Mysinger MM, Carchia M, Irwin JJ et al (2012) Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J Med Chem 55:6582–6594
Nishimura-Yabe C (1998) Aldose reductase in the polyol pathway: a potential target for the therapeutic intervention of diabetic complications, Nihon yakurigaku zasshi. Folia pharmacologica Japonica 111:137–145
Ramirez MA, Borja NL (2008) Epalrestat: an aldose reductase inhibitor for the treatment of diabetic neuropathy. Pharmacotherapy 28:646–655
Structure Checker ChemAxon, 2016. http://www.chemaxon.com
Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications, 2nd edn. Springer Verlag, Berlin, Germany
Harris CJ, Stevens AP (2006) Chemogenomics: structuring the drug discovery process to gene families. Drug Discov Today 11:880–888
Birault V, Harris CJ, Le J et al (2006) Bringing kinases into focus: efficient drug design through the use of chemogenomic toolkits. Curr Med Chem 13:1735–1748
Brown JB (2013) Systems chemical biology via computational compound-protein interaction prediction: core ideas, translational validity, and important perspectives, Invited Lecture at the Autumn School of Chemoinformatics, Nara, Japan
KNIME | Trusted Community Contributions, https://tech.knime.org/trusted-community-contributions
KNIME | Cheminformatics Extensions, https://tech.knime.org/cheminformatics-extensions
KNIME | Node description for MDS, https://www.knime.org/files/nodedetails/_mining_mds_MDS.html
Daylight Theory: SMILES, http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36
O’Boyle NM, Banck M, James CA et al (2011) Open babel: an open chemical toolbox. J Cheminform 3:1–14
Mauri A, Consonni V, Todeschini R (2016) Molecular descriptors. In: Leszczynski J (ed) Handbook of computational chemistry. Springer, Netherlands, pp 1–29
Schneider N, Sayle RA, Landrum GA (2015) Get your atoms in order—an open-source implementation of a novel and robust molecular canonicalization algorithm. J Chem Inf Model 55:2111–2120
Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29:97–101
O’Boyle NM (2012) Towards a universal SMILES representation–a standard method to generate canonical SMILES based on the InChI. J Chem 4:1–14
Koichi S, Iwata S, Uno T et al (2007) Algorithm for advanced canonical coding of planar chemical structures that considers stereochemical and symmetric information. J Chem Inf Model 47:1734–1746
RDKit: Open-source cheminformatics; http://www.rdkit.org,
Halgren TA (1996) Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J Comput Chem 17:490–519
Lipinski CA, Lombardo F, Dominy BW et al (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 23:3–25
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754
Carhart RE, Smith DH, Venkataraghavan R (1985) Atom pairs as molecular features in structure-activity studies: definition and applications. J Chem Inf Comput Sci 25:64–73
KNIME | Node description for Correlation Filter, https://www.knime.org/files/nodedetails/_statistics_Correlation_Filter.html
Todeschini R, Ballabio D, Consonni V et al (2016) A new concept of higher-order similarity and the role of distance/similarity measures in local classification methods. Chemom Intell Lab Syst 157:50–57
Todeschini R, Consonni V, Xiang H et al (2012) Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model 52:2884–2901
Hvidsten TR, Kryshtafovych A, Fidelis K (2009) Local descriptors of protein structure: a systematic analysis of the sequence-structure relationship in proteins using short- and long-range interactions. Proteins 75:870–884
Henschel A, Winter C, Kim WK et al (2007) Using structural motif descriptors for sequence-based binding site prediction. BMC Bioinformatics 8:S5
Li ZR, Lin HH, Han LY et al (2006) PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 34:W32–W37
O’Boyle NM, Banck M, James CA et al (2011) Open babel: an open chemical toolbox. J Chem 3:33
Dalby A, Nourse JG, Hounshell WD et al (1992) Description of several chemical structure file formats used by computer programs developed at molecular design limited. J Chem Inf Comput Sci 32:244–255
Marvin Sketch 5.1.11 ChemAxon, 2013. http://www.chemaxon.com
NCI/CADD Group (2013), Chemical Identifier Resolver. Available at: http://cactus.nci.nih.gov/chemical/ structure
Getting Started with the RDKit in Python—The RDKit 2016.09.1 documentation, http://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors
Steinbeck C, Han Y, Kuhn S et al (2003) The chemistry development kit (CDK): an open-source java library for chemo- and bioinformatics. J Chem Inf Comput Sci 43:493–500
Steinbeck C, Hoppe C, Kuhn S et al (2006) Recent developments of the chemistry development kit (CDK)-an open-source java library for chemo-and bioinformatics. Curr Pharm Des 12:2111–2120
Chemical Computing Group Inc. (2013) Molecular Operating Environment (MOE), 1010 Sherbooke St. West, Suite #910, Montreal, QC, Canada, H3A 2R7
Hong H, Xie Q, Ge W et al (2008) Mold2, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J Chem Inf Model 48:1337–1344
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Grisoni, F., Consonni, V., Todeschini, R. (2018). Impact of Molecular Descriptors on Computational Models. In: Brown, J. (eds) Computational Chemogenomics. Methods in Molecular Biology, vol 1825. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8639-2_5
Download citation
DOI: https://doi.org/10.1007/978-1-4939-8639-2_5
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8638-5
Online ISBN: 978-1-4939-8639-2
eBook Packages: Springer Protocols