Abstract
Molecules are often characterized by sparse binary fingerprints, where 1s represent the presence of substructures and 0s represent their absence. Fingerprints are especially useful for similarity calculations, such as database searching or clustering, generally measuring similarity as the Tanimoto coefficient. In other cases, such as visualization, design of experiments, or latent variable regression, a low-dimensional Euclidian “chemical space” is more useful, where proximity between points reflects chemical similarity. A temptation is to apply principal components analysis (PCA) directly to these fingerprints to obtain a low dimensional continuous chemical space. However, Gower has shown that distances from PCA on bit vectors are proportional to the square root of Hamming distance. Unlike Tanimoto similarity, Hamming similarity (HS) gives equal weight to shared 0s as to shared 1s, that is, HS gives as much weight to substructures that neither molecule contains, as to substructures which both molecules contain. Illustrative examples show that proximity in the corresponding chemical space reflects mainly similar size and complexity rather than shared chemical substructures. These spaces are ill-suited for visualizing and optimizing coverage of chemical space, or as latent variables for regression. A more suitable alternative is shown to be Multi-dimensional scaling on the Tanimoto distance matrix, which produces a space where proximity does reflect structural similarity.
Similar content being viewed by others
References
Distances similarity measures for binary data. http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fcmd_proximities_sim_measure_binary.htm. Accessed 17 June 2014
Johnston JW (2014) Similarity indices I: what do they measure. Battelle Pacific Northwest Labs., Richland, Washington. http://www.iaea.org/inis/collection/NCLCollectionStore/_Public/08/337/8337829.pdf. Accessed 17 July 2014
Seung-Seok Choi S-HC, Charles C. Tappert (2014) A survey of binary similarity and distance measures. Department of computer science, Pace University, New York. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.352.6123&rep=rep1&type=pdf. Accessed 25 July 2014
Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434):1115–1118
Martin EJ, Blaney JM, Siani MA, Spellmeyer DC, Wong AK, Moos WH (1995) Measuring diversity: experimental design of combinatorial libraries for drug discovery. J Med Chem 38(9):1431–1436. doi:10.1021/jm00009a003
Martin EJ, Critchlow RE (1999) Beyond mere diversity: tailoring combinatorial libraries for drug discovery. J Comb Chem 1(1):32–45. doi:10.1021/CC9800024
Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53(3–4):325–338
Young FW (1985) Multidimensional scaling. John wiley & Sons, Inc. http://forrest.psych.unc.edu/teaching/p208a/mds/mds.html. Accessed 15 July 2014
Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P (2012) Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model 52(11):2884–2901
Lounkine E, Kutchukian P, Petrone P, Davies JW, Glick M (2012) Chemotography for multi-target SAR analysis in the context of biological pathways. Bioorg Med Chem 20(18):5416–5427. doi:10.1016/j.bmc.2012.02.034
William R, Dillon MG (1984) Multivariate analysis methods and applications. Probability and mathematical statistics. Wiley, New York
Rassokhin DN, Agrafiotis DK (2003) A modified update rule for stochastic proximity embedding. J Mol Graph Model 22(2):133–140. doi:10.1016/S1093-3263(03)00155-4
Agrafiotis DK, Rassokhin DN, Lobanov VS (2001) Multidimensional scaling and visualization of large molecular similarity tables. J Comput Chem 22(5):488–500. doi:10.1002/1096-987X(20010415)22:5<488:AID-JCC1020>3.0.CO;2-4
Clark RD, Patterson DE, Soltanshahi F, Blake JF, Matthew JB (2000) Visualizing substructural fingerprints. J Mol Graph Model 18(4–5):432–527
Demartines P, Herault J (1997) Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets. IEEE Trans Neural Netw 8(1):148–154
Landrum G RDKit: open-source cheminformatics. http://www.rdkit.org. Accessed 25 July 2014
World Drug Index (2013). Thomson Reuters, New York
Hempel C (1945) Studies in the logic of confirmation (I.). Mind 54(213):1–26
Hempel C (1945) Studies in the logic of confirmation (II.). Mind 54(214):97–121
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Martin, E., Cao, E. Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens. J Comput Aided Mol Des 29, 387–395 (2015). https://doi.org/10.1007/s10822-014-9819-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-014-9819-y