Weighted edit distance optimized using genetic algorithm for SMILES-based compound similarity

Choi, In-Hyuk; Oh, Il-Seok

doi:10.1007/s10044-023-01141-3

Weighted edit distance optimized using genetic algorithm for SMILES-based compound similarity

Theoretical Advances
Published: 18 February 2023

Volume 26, pages 1161–1170, (2023)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

577 Accesses
1 Altmetric
Explore all metrics

Abstract

A method for developing new drugs is the ligand-based approach, which requires intermolecular similarity computation. The simplified molecular input line entry system (SMILES) is primarily used to represent the molecular structure in one dimension. It is a representation of molecular structure; the properties can be completely different even if only one character is changed. Applying the conventional edit distance method makes it difficult to obtain optimal results, because the insertion, deletion, and substitution of molecules are considered the same in calculating the distance. This study proposes a novel edit distance using an optimal weight set for three operations. To determine the optimal weight set, we present a genetic algorithm with suitable hyperparameters. To emphasize the impact of the proposed genetic algorithm, we compare it with the exhaustive search algorithm. The experiments performed with four well-known datasets showed that the weighted edit distance optimized with the genetic algorithm resulted in an average performance improvement in approximately 20%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction

Article Open access 18 March 2016

LINGO-DL: a text-based approach for molecular similarity searching

Article 02 April 2021

Molecular Similarity Concepts for Informatics Applications

Data availability statement

The datasets generated during and/or analyzed during the current study are available in this repository, https://github.com/Sabro98/GA-WeightedEditSimilarity/tree/master/data.

References

Gohlke H, Hendlich M, Klebe G (2000) Knowledge-based scoring function to predict protein-ligand interactions. J Mol Biol 295(2):337–356. https://doi.org/10.1006/jmbi.1999.3371
Article Google Scholar
Tabei Y, Pauwels E, Stoven V, Takemoto K, Yamanishi Y (2012) Identification of chemogenomic features from drug–target interaction networks using interpretable classifiers. Bioinformatics 28(18):487–494
Article Google Scholar
Sawada R, Kotera M, Yamanishi Y (2014) Benchmarking a wide range of chemical descriptors for drug-target interaction prediction using a chemogenomic approach. Mol Inf 33(11–12):719–731
Article Google Scholar
Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. J Chem Inf Comput Sci 38(6):983–996
Article Google Scholar
Schuffenhauer A, Gillet VJ, Willett P (2000) Similarity searching in files of three-dimensional chemical structures: analysis of the bioster database using two-dimensional fingerprints and molecular field descriptors. J Chem Inf Comput Sci 40(2):295–307
Article Google Scholar
Helguera AM, Combes RD, González MP, Cordeiro M (2008) Applications of 2d descriptors in drug design: a dragon tale. Curr Top Med Chem 8(18):1628–1655
Article Google Scholar
Hong H, Xie Q, Ge W, Qian F, Fang H, Shi L, Su Z, Perkins R, Tong W (2008) Mold2, molecular descriptors from 2d structures for chemoinformatics and toxicoinformatics. J Chem Inf Model 48(7):1337–1344
Article Google Scholar
Kombo DC, Tallapragada K, Jain R, Chewning J, Mazurov AA, Speake JD, Hauser TA, Toler S (2013) 3d molecular descriptors important for clinical success. J Chem Inf Model 53(2):327–342
Article Google Scholar
Weininger D (1998) Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36. https://doi.org/10.1021/ci00057a005
Article Google Scholar
Weininger D, Weininger A, Weininger JL (1989) Smiles. 2. Algorithm for generation of unique smiles notation. J Chem Inf Comput Sci 29(2):97–101. https://doi.org/10.1021/ci00062a008
Article MATH Google Scholar
Öztürk H, Ozkirimli E, Özgür A (2016) A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction. BMC Bioinform. https://doi.org/10.1186/s12859-016-0977-x
Article Google Scholar
Levenshtein VI (1996) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Doklady 10(8):707–710
MathSciNet Google Scholar
Islam A, Inkpen D (2008) Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discov Data 2(2):1–25
Article Google Scholar
Cao DS, Zhao JC, Yang YN, Zhao CX, Yan J, Liu S, Hu QN, Xu QS, Liang YZ (2012) In silico toxicity prediction by support vector machine and smiles representation-based string kernel. SAR QSAR Environ Res 23(1–2):141–153
Article Google Scholar
Schwartz J, Awale M, Reymond JL (2013) Smifp (smiles fingerprint) chemical space for virtual screening and visualization of large databases of organic molecules. J Chem Inf Model 53(8):1979–1989. https://doi.org/10.1021/ci400206h
Article Google Scholar
Krause EF (1986) An adventure in non-euclidean geometry. Dover Publication, New York
Google Scholar
Vidal D, Thormann M, Pons M (2005) LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J Chem Inf Model. https://doi.org/10.1021/ci0496797
Article Google Scholar
Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4):309–317. https://doi.org/10.1147/rd.14.0309
Article MathSciNet Google Scholar
Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21
Article Google Scholar
Bagherian M, Sabeti E, Wang K et al (2020) Machine learning approaches and databases for prediction of drug–target interaction: a survey paper. Brief Bioinform. https://doi.org/10.1093/bib/bbz157
Article Google Scholar
Sachdev K, Gupta MK (2019) A comprehensive review of feature based methods for drug target interaction prediction. J Biomed Inform. https://doi.org/10.1016/j.jbi.2019.103159
Article Google Scholar
Karimi M, Wu D, Wang Z et al (2019) DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz111
Article Google Scholar
Lee I, Keum J, Nam H (2019) DeepConv-DTI: prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS Comput Biol 15(6):e100719. https://doi.org/10.1371/journal.pcbi.1007129
Article Google Scholar
Lim J, Ryu S, Park K et al (2019) Predicting drug–target interaction using a novel graph neural network with 3D structure-embedded graph representation. J Chem Inf Model. https://doi.org/10.1021/acs.jcim.9b00387
Article Google Scholar
Huang K, Xiao C, Glass LM et al (2020) MolTrans: molecular Interaction Transformer for drug–target interaction prediction. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaa880
Article Google Scholar
Wang C, Kurgan L (2020) Survey of similarity-based prediction of drug-protein interactions. Curr Med Chem. https://doi.org/10.2174/0929867326666190808154841
Article Google Scholar
Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M (2008) Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24(13):232–240. https://doi.org/10.1093/bioinformatics/btn162
Article Google Scholar
Bleakley K, Yamanishi Y (2009) Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics. https://doi.org/10.1093/bioinformatics/btp433
Article Google Scholar
An Q, Yu L (2021) A heterogeneous network embedding framework for predicting similarity-based drug-target interactions. Brief Bioinform. https://doi.org/10.1093/bib/bbab275
Article Google Scholar
Zheng X, Ding H, Mamitsuka H et al (2013) Collaborative matrix factorization with multiple similarities for predicting drug-target. https://doi.org/10.1145/2487575.2487670
Ezzat A, Zhao P, Wu M et al (2017) Drug–target interaction prediction with graph regularized matrix factorization. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2016.2530062
Article Google Scholar
Väth P, Münch M, Raab C et al (2022) PROVAL: a framework for comparison of protein sequence embeddings. J Comput Math Data Sci. https://doi.org/10.1016/j.jcmds.2022.100044
Article Google Scholar
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory. https://doi.org/10.1109/TIT.1967.1053964
Article MATH Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory. https://doi.org/10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
Biehl M, Bunte K, Schneider P (2013) Analysis of flow cytometry data by matrix relevance learning vector quantization. PLoS ONE. https://doi.org/10.1371/journal.pone.0059401
Article Google Scholar
Kirstein S, Wersing H, Gross H-M et al (2012) A life-long learning vector quantization approach for interactive learning of multiple categories. Neural Netw. https://doi.org/10.1016/j.neunet.2011.12.003
Article Google Scholar
Backhaus A, Seiffert U (2014) Classification in high-dimensional spectral data: accuracy vs. interpretability vs. model size. Neurocomputing. https://doi.org/10.1016/j.neucom.2013.09.048
Article Google Scholar
Hammer B, Hofmann D, Schleif F-M et al (2014) Learning vector quantization for (dis-)similarities. Neurocomputing. https://doi.org/10.1016/j.neucom.2013.05.054
Article Google Scholar
Mokbel B, Paassen B, Schleif F-M et al (2015) Metric learning for sequences in relational LVQ. Neurocomputing. https://doi.org/10.1016/j.neucom.2014.11.082
Article Google Scholar
Zhang S, Hu Y, Bian G (2017) Research on string similarity algorithm based on Levenshtein Distance. https://doi.org/10.1109/IAEAC.2017.8054419
Mikolov T, Chen K, Corrado G et al (2013) Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781
Thafar MA, Olayan RS, Albaradei S et al (2021) DTi2Vec: drug–target interaction prediction using network embedding and ensemble learning. J Cheminform. https://doi.org/10.1186/s13321-021-00552-w
Article Google Scholar
Thomas H (2009) Cormen, introduction algorithms, 3rd edn. MIT Press, Cambridge
Google Scholar
van Laarhoven T, Marchiori E (2013) Predicting drug-target interactions for new drug compounds using a weighted nearest neighbor profile. PLoS ONE 8(6):66952. https://doi.org/10.1371/journal.pone.0066952
Article Google Scholar
Ruder S (2016) An overview of gradient descent optimization algorithms. https://arxiv.org/abs/1609.04747
Katoch S, Chauhan SS, Kumar V (2020) A review on genetic algorithm: past, present, and future. Multimed Tools Appl. https://doi.org/10.1007/s11042-020-10139-6
Article Google Scholar

Download references

Author information

Authors and Affiliations

Division of Computer Science and Engineering, Jeonbuk National University, Jeonju, 54896, South Korea
In-Hyuk Choi & Il-Seok Oh
Center for Advanced Image Information Technology, Jeonju, 54896, South Korea
Il-Seok Oh

Authors

In-Hyuk Choi
View author publications
You can also search for this author in PubMed Google Scholar
Il-Seok Oh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Il-Seok Oh.

Ethics declarations

Conflict of interest

The authors did not receive support from any organization for the submitted work. And all authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Choi, IH., Oh, IS. Weighted edit distance optimized using genetic algorithm for SMILES-based compound similarity. Pattern Anal Applic 26, 1161–1170 (2023). https://doi.org/10.1007/s10044-023-01141-3

Download citation

Received: 13 April 2022
Accepted: 24 January 2023
Published: 18 February 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10044-023-01141-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weighted edit distance optimized using genetic algorithm for SMILES-based compound similarity

Abstract

Access this article

Similar content being viewed by others

A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction

LINGO-DL: a text-based approach for molecular similarity searching

Molecular Similarity Concepts for Informatics Applications

Data availability statement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Weighted edit distance optimized using genetic algorithm for SMILES-based compound similarity

Abstract

Access this article

Similar content being viewed by others

A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction

LINGO-DL: a text-based approach for molecular similarity searching

Molecular Similarity Concepts for Informatics Applications

Data availability statement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation