Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens

Martin, Eric; Cao, Eddie

doi:10.1007/s10822-014-9819-y

Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens

Special Series: Statistics in Molecular Modeling
Guest Editor: Anthony Nicholls
Published: 05 December 2014

Volume 29, pages 387–395, (2015)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Eric Martin¹ &
Eddie Cao²

Abstract

Molecules are often characterized by sparse binary fingerprints, where 1s represent the presence of substructures and 0s represent their absence. Fingerprints are especially useful for similarity calculations, such as database searching or clustering, generally measuring similarity as the Tanimoto coefficient. In other cases, such as visualization, design of experiments, or latent variable regression, a low-dimensional Euclidian “chemical space” is more useful, where proximity between points reflects chemical similarity. A temptation is to apply principal components analysis (PCA) directly to these fingerprints to obtain a low dimensional continuous chemical space. However, Gower has shown that distances from PCA on bit vectors are proportional to the square root of Hamming distance. Unlike Tanimoto similarity, Hamming similarity (HS) gives equal weight to shared 0s as to shared 1s, that is, HS gives as much weight to substructures that neither molecule contains, as to substructures which both molecules contain. Illustrative examples show that proximity in the corresponding chemical space reflects mainly similar size and complexity rather than shared chemical substructures. These spaces are ill-suited for visualizing and optimizing coverage of chemical space, or as latent variables for regression. A more suitable alternative is shown to be Multi-dimensional scaling on the Tanimoto distance matrix, which produces a space where proximity does reflect structural similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics†

Article Open access 23 April 2021

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

References

Distances similarity measures for binary data. http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fcmd_proximities_sim_measure_binary.htm. Accessed 17 June 2014
Johnston JW (2014) Similarity indices I: what do they measure. Battelle Pacific Northwest Labs., Richland, Washington. http://www.iaea.org/inis/collection/NCLCollectionStore/_Public/08/337/8337829.pdf. Accessed 17 July 2014
Seung-Seok Choi S-HC, Charles C. Tappert (2014) A survey of binary similarity and distance measures. Department of computer science, Pace University, New York. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.352.6123&rep=rep1&type=pdf. Accessed 25 July 2014
Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434):1115–1118
Article CAS Google Scholar
Martin EJ, Blaney JM, Siani MA, Spellmeyer DC, Wong AK, Moos WH (1995) Measuring diversity: experimental design of combinatorial libraries for drug discovery. J Med Chem 38(9):1431–1436. doi:10.1021/jm00009a003
Article CAS Google Scholar
Martin EJ, Critchlow RE (1999) Beyond mere diversity: tailoring combinatorial libraries for drug discovery. J Comb Chem 1(1):32–45. doi:10.1021/CC9800024
Article CAS Google Scholar
Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53(3–4):325–338
Article Google Scholar
Young FW (1985) Multidimensional scaling. John wiley & Sons, Inc. http://forrest.psych.unc.edu/teaching/p208a/mds/mds.html. Accessed 15 July 2014
Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P (2012) Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model 52(11):2884–2901
Article CAS Google Scholar
Lounkine E, Kutchukian P, Petrone P, Davies JW, Glick M (2012) Chemotography for multi-target SAR analysis in the context of biological pathways. Bioorg Med Chem 20(18):5416–5427. doi:10.1016/j.bmc.2012.02.034
Article CAS Google Scholar
William R, Dillon MG (1984) Multivariate analysis methods and applications. Probability and mathematical statistics. Wiley, New York
Google Scholar
Rassokhin DN, Agrafiotis DK (2003) A modified update rule for stochastic proximity embedding. J Mol Graph Model 22(2):133–140. doi:10.1016/S1093-3263(03)00155-4
Article CAS Google Scholar
Agrafiotis DK, Rassokhin DN, Lobanov VS (2001) Multidimensional scaling and visualization of large molecular similarity tables. J Comput Chem 22(5):488–500. doi:10.1002/1096-987X(20010415)22:5<488:AID-JCC1020>3.0.CO;2-4
Article CAS Google Scholar
Clark RD, Patterson DE, Soltanshahi F, Blake JF, Matthew JB (2000) Visualizing substructural fingerprints. J Mol Graph Model 18(4–5):432–527
Google Scholar
Demartines P, Herault J (1997) Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets. IEEE Trans Neural Netw 8(1):148–154
Article CAS Google Scholar
Landrum G RDKit: open-source cheminformatics. http://www.rdkit.org. Accessed 25 July 2014
World Drug Index (2013). Thomson Reuters, New York
Hempel C (1945) Studies in the logic of confirmation (I.). Mind 54(213):1–26
Article Google Scholar
Hempel C (1945) Studies in the logic of confirmation (II.). Mind 54(214):97–121
Article Google Scholar

Download references

Author information

Authors and Affiliations

Global Discovery Chemistry, Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, CA, 94608-2916, USA
Eric Martin
Counsyl, Inc., 180 Kimball Way, South San Francisco, CA, 94080, USA
Eddie Cao

Authors

Eric Martin
View author publications
You can also search for this author in PubMed Google Scholar
Eddie Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eric Martin.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 65 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Martin, E., Cao, E. Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens. J Comput Aided Mol Des 29, 387–395 (2015). https://doi.org/10.1007/s10822-014-9819-y

Download citation

Received: 25 July 2014
Accepted: 24 November 2014
Published: 05 December 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s10822-014-9819-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens

Abstract

Access this article

Similar content being viewed by others

Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics†

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (PDF 65 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens

Abstract

Access this article

Similar content being viewed by others

Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics†

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (PDF 65 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation