Skip to main content
Log in

Weighted edit distance optimized using genetic algorithm for SMILES-based compound similarity

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

A method for developing new drugs is the ligand-based approach, which requires intermolecular similarity computation. The simplified molecular input line entry system (SMILES) is primarily used to represent the molecular structure in one dimension. It is a representation of molecular structure; the properties can be completely different even if only one character is changed. Applying the conventional edit distance method makes it difficult to obtain optimal results, because the insertion, deletion, and substitution of molecules are considered the same in calculating the distance. This study proposes a novel edit distance using an optimal weight set for three operations. To determine the optimal weight set, we present a genetic algorithm with suitable hyperparameters. To emphasize the impact of the proposed genetic algorithm, we compare it with the exhaustive search algorithm. The experiments performed with four well-known datasets showed that the weighted edit distance optimized with the genetic algorithm resulted in an average performance improvement in approximately 20%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability statement

The datasets generated during and/or analyzed during the current study are available in this repository, https://github.com/Sabro98/GA-WeightedEditSimilarity/tree/master/data.

References

  1. Gohlke H, Hendlich M, Klebe G (2000) Knowledge-based scoring function to predict protein-ligand interactions. J Mol Biol 295(2):337–356. https://doi.org/10.1006/jmbi.1999.3371

    Article  Google Scholar 

  2. Tabei Y, Pauwels E, Stoven V, Takemoto K, Yamanishi Y (2012) Identification of chemogenomic features from drug–target interaction networks using interpretable classifiers. Bioinformatics 28(18):487–494

    Article  Google Scholar 

  3. Sawada R, Kotera M, Yamanishi Y (2014) Benchmarking a wide range of chemical descriptors for drug-target interaction prediction using a chemogenomic approach. Mol Inf 33(11–12):719–731

    Article  Google Scholar 

  4. Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. J Chem Inf Comput Sci 38(6):983–996

    Article  Google Scholar 

  5. Schuffenhauer A, Gillet VJ, Willett P (2000) Similarity searching in files of three-dimensional chemical structures: analysis of the bioster database using two-dimensional fingerprints and molecular field descriptors. J Chem Inf Comput Sci 40(2):295–307

    Article  Google Scholar 

  6. Helguera AM, Combes RD, González MP, Cordeiro M (2008) Applications of 2d descriptors in drug design: a dragon tale. Curr Top Med Chem 8(18):1628–1655

    Article  Google Scholar 

  7. Hong H, Xie Q, Ge W, Qian F, Fang H, Shi L, Su Z, Perkins R, Tong W (2008) Mold2, molecular descriptors from 2d structures for chemoinformatics and toxicoinformatics. J Chem Inf Model 48(7):1337–1344

    Article  Google Scholar 

  8. Kombo DC, Tallapragada K, Jain R, Chewning J, Mazurov AA, Speake JD, Hauser TA, Toler S (2013) 3d molecular descriptors important for clinical success. J Chem Inf Model 53(2):327–342

    Article  Google Scholar 

  9. Weininger D (1998) Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36. https://doi.org/10.1021/ci00057a005

    Article  Google Scholar 

  10. Weininger D, Weininger A, Weininger JL (1989) Smiles. 2. Algorithm for generation of unique smiles notation. J Chem Inf Comput Sci 29(2):97–101. https://doi.org/10.1021/ci00062a008

    Article  MATH  Google Scholar 

  11. Öztürk H, Ozkirimli E, Özgür A (2016) A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction. BMC Bioinform. https://doi.org/10.1186/s12859-016-0977-x

    Article  Google Scholar 

  12. Levenshtein VI (1996) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Doklady 10(8):707–710

    MathSciNet  Google Scholar 

  13. Islam A, Inkpen D (2008) Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discov Data 2(2):1–25

    Article  Google Scholar 

  14. Cao DS, Zhao JC, Yang YN, Zhao CX, Yan J, Liu S, Hu QN, Xu QS, Liang YZ (2012) In silico toxicity prediction by support vector machine and smiles representation-based string kernel. SAR QSAR Environ Res 23(1–2):141–153

    Article  Google Scholar 

  15. Schwartz J, Awale M, Reymond JL (2013) Smifp (smiles fingerprint) chemical space for virtual screening and visualization of large databases of organic molecules. J Chem Inf Model 53(8):1979–1989. https://doi.org/10.1021/ci400206h

    Article  Google Scholar 

  16. Krause EF (1986) An adventure in non-euclidean geometry. Dover Publication, New York

    Google Scholar 

  17. Vidal D, Thormann M, Pons M (2005) LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J Chem Inf Model. https://doi.org/10.1021/ci0496797

    Article  Google Scholar 

  18. Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4):309–317. https://doi.org/10.1147/rd.14.0309

    Article  MathSciNet  Google Scholar 

  19. Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21

    Article  Google Scholar 

  20. Bagherian M, Sabeti E, Wang K et al (2020) Machine learning approaches and databases for prediction of drug–target interaction: a survey paper. Brief Bioinform. https://doi.org/10.1093/bib/bbz157

    Article  Google Scholar 

  21. Sachdev K, Gupta MK (2019) A comprehensive review of feature based methods for drug target interaction prediction. J Biomed Inform. https://doi.org/10.1016/j.jbi.2019.103159

    Article  Google Scholar 

  22. Karimi M, Wu D, Wang Z et al (2019) DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz111

    Article  Google Scholar 

  23. Lee I, Keum J, Nam H (2019) DeepConv-DTI: prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS Comput Biol 15(6):e100719. https://doi.org/10.1371/journal.pcbi.1007129

    Article  Google Scholar 

  24. Lim J, Ryu S, Park K et al (2019) Predicting drug–target interaction using a novel graph neural network with 3D structure-embedded graph representation. J Chem Inf Model. https://doi.org/10.1021/acs.jcim.9b00387

    Article  Google Scholar 

  25. Huang K, Xiao C, Glass LM et al (2020) MolTrans: molecular Interaction Transformer for drug–target interaction prediction. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaa880

    Article  Google Scholar 

  26. Wang C, Kurgan L (2020) Survey of similarity-based prediction of drug-protein interactions. Curr Med Chem. https://doi.org/10.2174/0929867326666190808154841

    Article  Google Scholar 

  27. Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M (2008) Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24(13):232–240. https://doi.org/10.1093/bioinformatics/btn162

    Article  Google Scholar 

  28. Bleakley K, Yamanishi Y (2009) Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics. https://doi.org/10.1093/bioinformatics/btp433

    Article  Google Scholar 

  29. An Q, Yu L (2021) A heterogeneous network embedding framework for predicting similarity-based drug-target interactions. Brief Bioinform. https://doi.org/10.1093/bib/bbab275

    Article  Google Scholar 

  30. Zheng X, Ding H, Mamitsuka H et al (2013) Collaborative matrix factorization with multiple similarities for predicting drug-target. https://doi.org/10.1145/2487575.2487670

  31. Ezzat A, Zhao P, Wu M et al (2017) Drug–target interaction prediction with graph regularized matrix factorization. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2016.2530062

    Article  Google Scholar 

  32. Väth P, Münch M, Raab C et al (2022) PROVAL: a framework for comparison of protein sequence embeddings. J Comput Math Data Sci. https://doi.org/10.1016/j.jcmds.2022.100044

    Article  Google Scholar 

  33. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory. https://doi.org/10.1109/TIT.1967.1053964

    Article  MATH  Google Scholar 

  34. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory. https://doi.org/10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  35. Biehl M, Bunte K, Schneider P (2013) Analysis of flow cytometry data by matrix relevance learning vector quantization. PLoS ONE. https://doi.org/10.1371/journal.pone.0059401

    Article  Google Scholar 

  36. Kirstein S, Wersing H, Gross H-M et al (2012) A life-long learning vector quantization approach for interactive learning of multiple categories. Neural Netw. https://doi.org/10.1016/j.neunet.2011.12.003

    Article  Google Scholar 

  37. Backhaus A, Seiffert U (2014) Classification in high-dimensional spectral data: accuracy vs. interpretability vs. model size. Neurocomputing. https://doi.org/10.1016/j.neucom.2013.09.048

    Article  Google Scholar 

  38. Hammer B, Hofmann D, Schleif F-M et al (2014) Learning vector quantization for (dis-)similarities. Neurocomputing. https://doi.org/10.1016/j.neucom.2013.05.054

    Article  Google Scholar 

  39. Mokbel B, Paassen B, Schleif F-M et al (2015) Metric learning for sequences in relational LVQ. Neurocomputing. https://doi.org/10.1016/j.neucom.2014.11.082

    Article  Google Scholar 

  40. Zhang S, Hu Y, Bian G (2017) Research on string similarity algorithm based on Levenshtein Distance. https://doi.org/10.1109/IAEAC.2017.8054419

  41. Mikolov T, Chen K, Corrado G et al (2013) Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781

  42. Thafar MA, Olayan RS, Albaradei S et al (2021) DTi2Vec: drug–target interaction prediction using network embedding and ensemble learning. J Cheminform. https://doi.org/10.1186/s13321-021-00552-w

    Article  Google Scholar 

  43. Thomas H (2009) Cormen, introduction algorithms, 3rd edn. MIT Press, Cambridge

    Google Scholar 

  44. van Laarhoven T, Marchiori E (2013) Predicting drug-target interactions for new drug compounds using a weighted nearest neighbor profile. PLoS ONE 8(6):66952. https://doi.org/10.1371/journal.pone.0066952

    Article  Google Scholar 

  45. Ruder S (2016) An overview of gradient descent optimization algorithms. https://arxiv.org/abs/1609.04747

  46. Katoch S, Chauhan SS, Kumar V (2020) A review on genetic algorithm: past, present, and future. Multimed Tools Appl. https://doi.org/10.1007/s11042-020-10139-6

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Il-Seok Oh.

Ethics declarations

Conflict of interest

The authors did not receive support from any organization for the submitted work. And all authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Choi, IH., Oh, IS. Weighted edit distance optimized using genetic algorithm for SMILES-based compound similarity. Pattern Anal Applic 26, 1161–1170 (2023). https://doi.org/10.1007/s10044-023-01141-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-023-01141-3

Keywords

Navigation