Extended continuous similarity indices: theory and application for QSAR descriptor selection

Rácz, Anita; Dunn, Timothy B.; Bajusz, Dávid; Kim, Taewon D.; Miranda-Quintana, Ramón Alain; Héberger, Károly

doi:10.1007/s10822-022-00444-7

Extended continuous similarity indices: theory and application for QSAR descriptor selection

Published: 15 March 2022

Volume 36, pages 157–173, (2022)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Anita Rácz¹^na1,
Timothy B. Dunn²^na1,
Dávid Bajusz³,
Taewon D. Kim²,
Ramón Alain Miranda-Quintana ORCID: orcid.org/0000-0003-2121-4449^2,4 &
…
Károly Héberger¹

777 Accesses
8 Citations
5 Altmetric
Explore all metrics

Abstract

Extended (or n-ary) similarity indices have been recently proposed to extend the comparative analysis of binary strings. Going beyond the traditional notion of pairwise comparisons, these novel indices allow comparing any number of objects at the same time. This results in a remarkable efficiency gain with respect to other approaches, since now we can compare N molecules in O(N) instead of the common quadratic O(N²) timescale. This favorable scaling has motivated the application of these indices to diversity selection, clustering, phylogenetic analysis, chemical space visualization, and post-processing of molecular dynamics simulations. However, the current formulation of the n-ary indices is limited to vectors with binary or categorical inputs. Here, we present the further generalization of this formalism so it can be applied to numerical data, i.e. to vectors with continuous components. We discuss several ways to achieve this extension and present their analytical properties. As a practical example, we apply this formalism to the problem of feature selection in QSAR and prove that the extended continuous similarity indices provide a convenient way to discern between several sets of descriptors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Tale of Four Metrics

Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics†

Article Open access 23 April 2021

Novel global and local 3D atom-based linear descriptors of the Minkowski distance matrix: theory, diversity–variability analysis and QSPR applications

Article 18 July 2015

Data availability

Data is available from the authors upon reasonable request. Python code for calculating the extended similarity measures is freely available at: https://github.com/ramirandaq/MultipleComparisons

References

Bajusz D, Rácz A, Héberger K (2017) Chemical data formats, fingerprints, and other molecular descriptions for database analysis and searching. In: Chackalamannil S, Rotella DP, Ward SE (eds) Comprehensive medicinal chemistry III. Elsevier, Oxford, pp 329–378
Chapter Google Scholar
Bender A, Glen RC (2004) Molecular similarity: a key technique in molecular informatics. Org Biomol Chem 2:3204–3218
Article CAS Google Scholar
Bajusz D, Rácz A, Héberger K (2015) Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform 7:20. https://doi.org/10.1186/s13321-015-0069-3
Article CAS PubMed PubMed Central Google Scholar
Saxena A, Prasad M, Gupta A et al (2017) A review of clustering techniques and developments. Neurocomputing 267:664–681. https://doi.org/10.1016/J.NEUCOM.2017.06.053
Article Google Scholar
Geppert H, Vogt M, Bajorath J (2010) Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. J Chem Inf Model 50:205–216. https://doi.org/10.1021/ci900419k
Article CAS PubMed Google Scholar
Eckert H, Bajorath J (2007) Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug Discov Today 12:225–233. https://doi.org/10.1016/j.drudis.2007.01.011
Article CAS PubMed Google Scholar
Willett P (2009) Similarity methods in chemoinformatics. Annu Rev Inf Sci Technol 43:1–117. https://doi.org/10.1002/aris.2009.1440430108
Article Google Scholar
Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11:1046–1053. https://doi.org/10.1016/j.drudis.2006.10.005
Article CAS PubMed Google Scholar
Willett P (2013) Fusing similarity rankings in ligand-based virtual screening. Comput Struct Biotechnol J 5:e201302002. https://doi.org/10.5936/csbj.201302002
Article PubMed PubMed Central Google Scholar
Willett P (2013) Combination of similarity rankings using data fusion. J Chem Inf Model 53:1–10. https://doi.org/10.1021/ci300547g
Article CAS PubMed Google Scholar
Todeschini R, Consonni V, Xiang H et al (2012) Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model 52:2884–2901. https://doi.org/10.1021/ci300261r
Article CAS PubMed Google Scholar
Rácz A, Andrić F, Bajusz D, Héberger K (2018) Binary similarity measures for fingerprint analysis of qualitative metabolomic profiles. Metabolomics. https://doi.org/10.1007/s11306-018-1327-y
Article PubMed PubMed Central Google Scholar
Rácz A, Bajusz D, Héberger K (2018) Life beyond the Tanimoto coefficient: similarity measures for interaction fingerprints. J Cheminform 10:48. https://doi.org/10.1186/s13321-018-0302-y
Article CAS PubMed PubMed Central Google Scholar
Miranda-Quintana RA, Bajusz D, Rácz A, Héberger K (2021) Differential consistency analysis: which similarity measures can be applied in drug discovery? Mol Inform 40:2060017. https://doi.org/10.1002/minf.202060017
Article CAS Google Scholar
Miranda-Quintana RA, Bajusz D, Rácz A, Héberger K (2021) Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: theory and characteristics. J Cheminform 13:32. https://doi.org/10.1186/s13321-021-00505-3
Article PubMed PubMed Central Google Scholar
Miranda-Quintana RA, Rácz A, Bajusz D, Héberger K (2021) Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 2: speed, consistency, diversity selection. J Cheminform 13:33. https://doi.org/10.1186/s13321-021-00504-4
Article PubMed PubMed Central Google Scholar
Dunn TB, Seabra GM, Kim TD et al (2021) Diversity and chemical library networks of large data sets. J Chem Inf Model. https://doi.org/10.1021/ACS.JCIM.1C01013
Article PubMed Google Scholar
Chang L, Perez A, Miranda-Quintana RA (2021) Improving the analysis of biological ensembles through extended similarity measures. BioRxiv. https://doi.org/10.1101/2021.08.08.455555
Article PubMed PubMed Central Google Scholar
Flores-Padilla A, Eurídice Juárez-Mercado K, Naveja JJ et al (2021) Chemoinformatic characterization of synthetic screening libraries focused on epigenetic targets. ChemRxiv. https://doi.org/10.33774/CHEMRXIV-2021-0PQ98
Article Google Scholar
Bajusz D, Miranda-Quintana RA, Rácz A, Héberger K (2021) Extended many-item similarity indices for sets of nucleotide and protein sequences. Comput Struct Biotechnol J 19:3628–3639. https://doi.org/10.1016/j.csbj.2021.06.021
Article CAS PubMed PubMed Central Google Scholar
Cherkasov A, Muratov EN, Fourches D et al (2014) QSAR modeling: where have you been? Where are you going to? J Med Chem 57:4977–5010. https://doi.org/10.1021/jm4004285
Article CAS PubMed PubMed Central Google Scholar
Piir G, Kahn I, García-Sosa AT et al (2018) Best practices for QSAR model reporting: physical and chemical properties, ecotoxicity, environmental fate, human health, and toxicokinetics endpoints. Environ Health Perspect 126:126001. https://doi.org/10.1289/EHP3264
Article CAS PubMed PubMed Central Google Scholar
Algamal ZY, Qasim MK, Lee MH, Mohammad Ali HT (2020) High-dimensional QSAR/QSPR classification modeling based on improving pigeon optimization algorithm. Chemom Intell Lab Syst 206:104170. https://doi.org/10.1016/J.CHEMOLAB.2020.104170
Article CAS Google Scholar
Gaulton A, Bellis LJ, Bento AP et al (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:D1100–D1107. https://doi.org/10.1093/nar/gkr777
Article CAS PubMed Google Scholar
Bolton EE, Wang Y, Thiessen PA, Bryant SH (2008) Chapter 12—PubChem: integrated platform of small molecules and biological activities. Annual reports in computational chemistry. Elsevier, Amsterdam, pp 217–241
Google Scholar
Andersen CM, Bro R (2010) Variable selection in regression—a tutorial. J Chemom 24:728–737. https://doi.org/10.1002/cem.1360
Article CAS Google Scholar
Leardi R (2007) Genetic algorithms in chemistry. J Chromatogr A 1158:226–233. https://doi.org/10.1016/J.CHROMA.2007.04.025
Article CAS PubMed Google Scholar
Goodarzi M, Dejaegher B, Vander HY (2012) Feature selection methods in QSAR studies. J AOAC Int 95:636–651. https://doi.org/10.5740/JAOACINT.SGE_GOODARZI
Article CAS PubMed Google Scholar
Eklund M, Norinder U, Boyer S, Carlsson L (2014) Choosing feature selection and learning algorithms in QSAR. J Chem Inf Model 54:837–843. https://doi.org/10.1021/CI400573C
Article CAS PubMed Google Scholar
National Center for Biotechnology Information. PubChem Database. Source=NCGC, AID=1851
Rácz A, Bajusz D, Miranda-Quintana RA, Héberger K (2021) Machine learning models for classification tasks related to drug safety. Mol Divers 25:1409–1424. https://doi.org/10.1007/s11030-021-10239-x
Article CAS PubMed PubMed Central Google Scholar
Mauri A, Consonni V, Pavan M, Todeschini R (2006) Dragon software: an easy approach to molecular descriptor calculations. MATCH Commun Math Comput Chem 56:237–248
CAS Google Scholar
(2018) Dragon 7.0, Kode Cheminformatics. Dragon 70, Kode Cheminformatics
Rácz A, Bajusz D, Héberger K (2019) Intercorrelation limits in molecular descriptor preselection for QSAR/QSPR. Mol Inform 38:1800154. https://doi.org/10.1002/minf.201800154
Article CAS PubMed Central Google Scholar
Bajusz D, Ferenczy GG, Keserű GM (2015) Property-based characterization of kinase-like ligand space for library design and virtual screening. Med Chem Commun 6:1898–1904. https://doi.org/10.1039/C5MD00253B
Article CAS Google Scholar
Kelemen AA, Ferenczy GG, Keserű GM (2015) A desirability function-based scoring scheme for selecting fragment-like class A aminergic GPCR ligands. J Comput Aided Mol Des 29:59–66. https://doi.org/10.1007/s10822-014-9804-5
Article CAS PubMed Google Scholar
Héberger K (2010) Sum of ranking differences compares methods or models fairly. TrAC Trends Anal Chem 29:101–109. https://doi.org/10.1016/j.trac.2009.09.009
Article CAS Google Scholar
Sipos L, Gere A, Popp J, Kovács S (2018) A novel ranking distance measure combining Cayley and Spearman footrule metrics. J Chemom 32:e3011. https://doi.org/10.1002/cem.3011
Article CAS Google Scholar
Héberger K, Kollár-Hunek K (2011) Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers. J Chemom 25:151–158. https://doi.org/10.1002/cem.1320
Article CAS Google Scholar
Héberger K, Kollár-Hunek K (2019) Comparison of validation variants by sum of ranking differences and ANOVA. J Chemom 33:e3104. https://doi.org/10.1002/CEM.3104
Article Google Scholar
Lourenco JM, Lebensztajn L (2018) Post-Pareto optimality analysis with sum of ranking differences. IEEE Trans Magn 54:1–10. https://doi.org/10.1109/TMAG.2018.2836327
Article Google Scholar
Gere A, Rácz A, Bajusz D, Héberger K (2021) Multicriteria decision making for evergreen problems in food science by sum of ranking differences. Food Chem 344:128617. https://doi.org/10.1016/j.foodchem.2020.128617
Article CAS PubMed Google Scholar
Saratxaga CL, Bote J, Ortega-Morán JF et al (2021) Characterization of optical coherence tomography images for colon lesion differentiation under deep learning. Appl Sci 11:3119. https://doi.org/10.3390/APP11073119
Article CAS Google Scholar
Sziklai BR (2021) Ranking institutions within a discipline: the steep mountain of academic excellence. J Informetr 15:101133. https://doi.org/10.1016/J.JOI.2021.101133
Article Google Scholar
West C (2018) Statistics for analysts who hate statistics, part VII: sum of ranking differences (SRD). LCGC North Am 36:2–6
Google Scholar

Download references

Funding

National Research, Development and Innovation Office of Hungary (OTKA), contracts K_20 134260 (KH) and PD_20 134416 (AR). University of Florida: startup grant: RAMQ. Hungarian Academy of Sciences: János Bolyai Research Scholarship: DB. Ministry for Innovation and Technology of Hungary: ÚNKP-21-5 New National Excellence Program: DB.

Author information

Anita Rácz and Timothy B. Dunn have contributed equally to this work.

Authors and Affiliations

Plasma Chemistry Research Group, Research Centre for Natural Sciences, Magyar Tudósok Krt. 2, 1117, Budapest, Hungary
Anita Rácz & Károly Héberger
Department of Chemistry, University of Florida, Gainesville, FL, 32611, USA
Timothy B. Dunn, Taewon D. Kim & Ramón Alain Miranda-Quintana
Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Magyar Tudósok Krt. 2, 1117, Budapest, Hungary
Dávid Bajusz
Quantum Theory Project, University of Florida, Gainesville, FL, 32611, USA
Ramón Alain Miranda-Quintana

Authors

Anita Rácz
View author publications
You can also search for this author in PubMed Google Scholar
Timothy B. Dunn
View author publications
You can also search for this author in PubMed Google Scholar
Dávid Bajusz
View author publications
You can also search for this author in PubMed Google Scholar
Taewon D. Kim
View author publications
You can also search for this author in PubMed Google Scholar
Ramón Alain Miranda-Quintana
View author publications
You can also search for this author in PubMed Google Scholar
Károly Héberger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ramón Alain Miranda-Quintana or Károly Héberger.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 4056 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rácz, A., Dunn, T.B., Bajusz, D. et al. Extended continuous similarity indices: theory and application for QSAR descriptor selection. J Comput Aided Mol Des 36, 157–173 (2022). https://doi.org/10.1007/s10822-022-00444-7

Download citation

Received: 11 November 2021
Accepted: 23 February 2022
Published: 15 March 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10822-022-00444-7

Keywords

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extended continuous similarity indices: theory and application for QSAR descriptor selection

Abstract

Access this article

Similar content being viewed by others

A Tale of Four Metrics

Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics†

Novel global and local 3D atom-based linear descriptors of the Minkowski distance matrix: theory, diversity–variability analysis and QSPR applications

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 4056 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extended continuous similarity indices: theory and application for QSAR descriptor selection

Abstract

Access this article

Similar content being viewed by others

A Tale of Four Metrics

Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics†

Novel global and local 3D atom-based linear descriptors of the Minkowski distance matrix: theory, diversity–variability analysis and QSPR applications

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 4056 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation