Skip to main content
Log in

Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens

  • Special Series: Statistics in Molecular Modeling
  • Guest Editor: Anthony Nicholls
  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

Molecules are often characterized by sparse binary fingerprints, where 1s represent the presence of substructures and 0s represent their absence. Fingerprints are especially useful for similarity calculations, such as database searching or clustering, generally measuring similarity as the Tanimoto coefficient. In other cases, such as visualization, design of experiments, or latent variable regression, a low-dimensional Euclidian “chemical space” is more useful, where proximity between points reflects chemical similarity. A temptation is to apply principal components analysis (PCA) directly to these fingerprints to obtain a low dimensional continuous chemical space. However, Gower has shown that distances from PCA on bit vectors are proportional to the square root of Hamming distance. Unlike Tanimoto similarity, Hamming similarity (HS) gives equal weight to shared 0s as to shared 1s, that is, HS gives as much weight to substructures that neither molecule contains, as to substructures which both molecules contain. Illustrative examples show that proximity in the corresponding chemical space reflects mainly similar size and complexity rather than shared chemical substructures. These spaces are ill-suited for visualizing and optimizing coverage of chemical space, or as latent variables for regression. A more suitable alternative is shown to be Multi-dimensional scaling on the Tanimoto distance matrix, which produces a space where proximity does reflect structural similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Distances similarity measures for binary data. http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fcmd_proximities_sim_measure_binary.htm. Accessed 17 June 2014

  2. Johnston JW (2014) Similarity indices I: what do they measure. Battelle Pacific Northwest Labs., Richland, Washington. http://www.iaea.org/inis/collection/NCLCollectionStore/_Public/08/337/8337829.pdf. Accessed 17 July 2014

  3. Seung-Seok Choi S-HC, Charles C. Tappert (2014) A survey of binary similarity and distance measures. Department of computer science, Pace University, New York. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.352.6123&rep=rep1&type=pdf. Accessed 25 July 2014

  4. Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434):1115–1118

    Article  CAS  Google Scholar 

  5. Martin EJ, Blaney JM, Siani MA, Spellmeyer DC, Wong AK, Moos WH (1995) Measuring diversity: experimental design of combinatorial libraries for drug discovery. J Med Chem 38(9):1431–1436. doi:10.1021/jm00009a003

    Article  CAS  Google Scholar 

  6. Martin EJ, Critchlow RE (1999) Beyond mere diversity: tailoring combinatorial libraries for drug discovery. J Comb Chem 1(1):32–45. doi:10.1021/CC9800024

    Article  CAS  Google Scholar 

  7. Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53(3–4):325–338

    Article  Google Scholar 

  8. Young FW (1985) Multidimensional scaling. John wiley & Sons, Inc. http://forrest.psych.unc.edu/teaching/p208a/mds/mds.html. Accessed 15 July 2014

  9. Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P (2012) Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model 52(11):2884–2901

    Article  CAS  Google Scholar 

  10. Lounkine E, Kutchukian P, Petrone P, Davies JW, Glick M (2012) Chemotography for multi-target SAR analysis in the context of biological pathways. Bioorg Med Chem 20(18):5416–5427. doi:10.1016/j.bmc.2012.02.034

    Article  CAS  Google Scholar 

  11. William R, Dillon MG (1984) Multivariate analysis methods and applications. Probability and mathematical statistics. Wiley, New York

    Google Scholar 

  12. Rassokhin DN, Agrafiotis DK (2003) A modified update rule for stochastic proximity embedding. J Mol Graph Model 22(2):133–140. doi:10.1016/S1093-3263(03)00155-4

    Article  CAS  Google Scholar 

  13. Agrafiotis DK, Rassokhin DN, Lobanov VS (2001) Multidimensional scaling and visualization of large molecular similarity tables. J Comput Chem 22(5):488–500. doi:10.1002/1096-987X(20010415)22:5<488:AID-JCC1020>3.0.CO;2-4

    Article  CAS  Google Scholar 

  14. Clark RD, Patterson DE, Soltanshahi F, Blake JF, Matthew JB (2000) Visualizing substructural fingerprints. J Mol Graph Model 18(4–5):432–527

    Google Scholar 

  15. Demartines P, Herault J (1997) Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets. IEEE Trans Neural Netw 8(1):148–154

    Article  CAS  Google Scholar 

  16. Landrum G RDKit: open-source cheminformatics. http://www.rdkit.org. Accessed 25 July 2014

  17. World Drug Index (2013). Thomson Reuters, New York

  18. Hempel C (1945) Studies in the logic of confirmation (I.). Mind 54(213):1–26

    Article  Google Scholar 

  19. Hempel C (1945) Studies in the logic of confirmation (II.). Mind 54(214):97–121

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eric Martin.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 65 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Martin, E., Cao, E. Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens. J Comput Aided Mol Des 29, 387–395 (2015). https://doi.org/10.1007/s10822-014-9819-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-014-9819-y

Keywords

Navigation