Skip to main content

Advertisement

Log in

LINGO-DL: a text-based approach for molecular similarity searching

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

The line notations of chemical structures are more compact than those of graphs and connection tables, so they can be useful for storing and transferring a large number of molecular structures. The simplified molecular input line system (SMILES) representation is the most extensively used, as it is much easier to utilise and comprehend than others, and it can be generated automatically from connection tables. A SMILES represents and encodes the molecule structure. It has been used by an existing method, LINGO, to calculate the molecular similarities and predict the structure-related properties. The LINGO method decomposes a canonical SMILES into a set of substrings of four characters referred to as LINGOs. The purpose of LINGO method is to measure the similarity between a pair of molecules by comparing the LINGOs that occur in each molecule. This paper aims to introduce an alternative version of the LINGO method using LINGOs of different lengths, called LINGO-DL. LINGO-DL is based on the fragmentation of canonical SMILES into substrings of three different lengths rather than one in LINGO method. Retrospective virtual screening experiments with MDDR, DUD, and MUV datasets show that the LINGO-DL outperforms the LINGO method, especially when the active molecules being sought have a high degree of structural heterogeneity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. J Chem Inf Comput Sci 38:983–996. https://doi.org/10.1021/ci9800211

    Article  CAS  Google Scholar 

  2. Stumpfe D, Bajorath J (2011) Similarity searching. WIREs Comput Mol Sci 1:260–282. https://doi.org/10.1002/wcms.23

    Article  CAS  Google Scholar 

  3. Banegas-Luna A-J, Cerón-Carrasco JP, Pérez-Sánchez H (2018) A review of ligand-based virtual screening web tools and screening algorithms in large molecular databases in the age of big data. Future Med Chem 10:2641–2658. https://doi.org/10.4155/fmc-2018-0076

    Article  CAS  PubMed  Google Scholar 

  4. Sheridan RP, Kearsley SK (2002) Why do we need so many chemical similarity search methods? Drug Discov Today 7:903–911. https://doi.org/10.1016/S1359-6446(02)02411-X

    Article  PubMed  Google Scholar 

  5. Sliwoski G, Kothiwale S, Meiler J, Lowe EW (2014) Computational methods in drug discovery. Pharmacol Rev 66:334–395. https://doi.org/10.1124/pr.112.007336

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11:1046–1053. https://doi.org/10.1016/j.drudis.2006.10.005

    Article  CAS  PubMed  Google Scholar 

  7. Leach AR, Gillet VJ (2007) An introduction to chemoinformatics. Springer, Dordrecht

    Book  Google Scholar 

  8. Nikolova N, Jaworska J (2003) Approaches to measure chemical similarity—a review. QSAR Comb Sci 22:1006–1026. https://doi.org/10.1002/qsar.200330831

    Article  CAS  Google Scholar 

  9. Bender A, Glen RC (2004) Molecular similarity: a key technique in molecular informatics. Org Biomol Chem 2:3204–3218. https://doi.org/10.1039/B409813G

    Article  CAS  PubMed  Google Scholar 

  10. Maldonado AG, Doucet JP, Petitjean M, Fan B-T (2006) Molecular similarity and diversity in chemoinformatics: from theory to applications. Mol Divers 10:39–79. https://doi.org/10.1007/s11030-006-8697-1

    Article  CAS  PubMed  Google Scholar 

  11. Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M, Davies JW (2009) How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model 49:108–119. https://doi.org/10.1021/ci800249s

    Article  CAS  PubMed  Google Scholar 

  12. Kearsley SK, Sallamack S, Fluder EM, Andose JD, Mosley RT, Sheridan RP (1996) Chemical similarity using physiochemical property descriptors. J Chem Inf Comput Sci 36:118–127. https://doi.org/10.1021/ci950274j

    Article  CAS  Google Scholar 

  13. Brown RD, Martin YC (1997) The information content of 2D and 3D structural descriptors relevant to ligand-receptor binding. J Chem Inf Comput Sci 37:1–9. https://doi.org/10.1021/ci960373c

    Article  CAS  Google Scholar 

  14. Schuffenhauer A, Gillet VJ, Willett P (2000) Similarity searching in files of three-dimensional chemical structures: analysis of the BIOSTER database using two-dimensional fingerprints and molecular field descriptors. J Chem Inf Comput Sci 40:295–307. https://doi.org/10.1021/ci990263g

    Article  CAS  PubMed  Google Scholar 

  15. Brown RD, Martin YC (1996) Use of structure−activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Comput Sci 36:572–584. https://doi.org/10.1021/ci9501047

    Article  CAS  Google Scholar 

  16. Matter H, Pötter T (1999) Comparing 3D pharmacophore triplets and 2D fingerprints for selecting diverse compound subsets. J Chem Inf Comput Sci 39:1211–1225. https://doi.org/10.1021/ci980185h

    Article  CAS  Google Scholar 

  17. Matter H (1997) Selecting optimally diverse compounds from structure databases: a validation study of two-dimensional and three-dimensional molecular descriptors. J Med Chem 40:1219–1229. https://doi.org/10.1021/jm960352+

    Article  CAS  PubMed  Google Scholar 

  18. WISWESSER WJ (1952) The Wiswesser line formula notation. Chem Eng News Arch 30:3523–3526. https://doi.org/10.1021/cen-v030n034.p3523

    Article  Google Scholar 

  19. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36. https://doi.org/10.1021/ci00057a005

    Article  CAS  Google Scholar 

  20. Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29:97–101. https://doi.org/10.1021/ci00062a008

    Article  CAS  Google Scholar 

  21. Barnard MJ, Jochum CJ, Welford SM (1989) ROSDAL: a universal structure/substructure representation for PC-host communication. In: Warr WA (ed) Chemical structure information systemes—interfaces, communication, and standards. American Chemical Society, Washington, DC, pp 76–81

    Chapter  Google Scholar 

  22. Ash S, Cline MA, Homer RW, Hurst T, Smith GB (1997) SYBYL line notation (SLN): a versatile language for chemical structure representation. J Chem Inf Comput Sci 37:71–79. https://doi.org/10.1021/ci960109j

    Article  CAS  Google Scholar 

  23. Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD (2008) SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries. J Chem Inf Model 48:2294–2307. https://doi.org/10.1021/ci7004687

    Article  CAS  PubMed  Google Scholar 

  24. Vidal D, Thormann M, Pons M (2005) LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J Chem Inf Model 45:386–393. https://doi.org/10.1021/ci0496797

    Article  CAS  PubMed  Google Scholar 

  25. Vidal D, Thormann M, Pons M (2006) A novel search engine for virtual screening of very large databases. J Chem Inf Model 46:836–843. https://doi.org/10.1021/ci050458q

    Article  CAS  PubMed  Google Scholar 

  26. Schwartz J, Awale M, Reymond J-L (2013) SMIfp (SMILES fingerprint) chemical space for virtual screening and visualization of large databases of organic molecules. J Chem Inf Model 53:1979–1989. https://doi.org/10.1021/ci400206h

    Article  CAS  PubMed  Google Scholar 

  27. Identifying Structure–Property Relationships through SMILES Syntax Analysis with Self-Attention Mechanism | Journal of Chemical Information and Modeling. https://pubs.acs.org/doi/10.1021/acs.jcim.8b00803. Accessed 17 Jun 2020

  28. Grant JA, Haigh JA, Pickup BT, Nicholls A, Sayle RA (2006) Lingos, finite state machines, and fast similarity searching. J Chem Inf Model 46:1912–1918. https://doi.org/10.1021/ci6002152

    Article  CAS  PubMed  Google Scholar 

  29. BIOVIA Databases | Bioactivity Databases: MDDR

  30. Huang N, Shoichet BK, Irwin JJ (2006) Benchmarking sets for molecular docking. J Med Chem 49:6789–6801. https://doi.org/10.1021/jm0608356

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Rohrer SG, Baumann KMUV (2009) Data sets for virtual screening based on PubChem bioactivity data. J Chem Inf Model 49:169–184. https://doi.org/10.1021/ci8002649

    Article  CAS  PubMed  Google Scholar 

  32. Pipeline Pilot Scientific Application Overview | Dassault Systèmes BIOVIA

  33. Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A (2006) New methods for ligand-based virtual screening: use of data fusion and machine learning to enhance the effectiveness of similarity searching. J Chem Inf Model 46:462–470. https://doi.org/10.1021/ci050348j

    Article  CAS  PubMed  Google Scholar 

  34. Jahn A, Hinselmann G, Fechner N, Zell A (2009) Optimal assignment methods for ligand-based virtual screening. J Cheminform 1:14. https://doi.org/10.1186/1758-2946-1-14

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Abdo A, Chen B, Mueller C, Salim N, Willett P (2010) Ligand-based virtual screening using bayesian networks. J Chem Inf Model 50:1012–1020. https://doi.org/10.1021/ci100090p

    Article  CAS  PubMed  Google Scholar 

  36. Abdo A, Saeed F, Hamza H, Ahmed A, Salim N (2012) Ligand expansion in ligand-based virtual screening using relevance feedback. J Comput Aided Mol Des 26:279–287. https://doi.org/10.1007/s10822-012-9543-4

    Article  CAS  PubMed  Google Scholar 

  37. Cincilla G, Thormann M, Pons M (2010) Structuring chemical space: similarity-based characterization of the PubChem database. Mol Inf 29:37–49. https://doi.org/10.1002/minf.200900015

    Article  CAS  Google Scholar 

  38. Abdo A, Salim N (2011) New fragment weighting scheme for the Bayesian inference network in ligand-based virtual screening. J Chem Inf Model 51:25–32. https://doi.org/10.1021/ci100232h

    Article  CAS  PubMed  Google Scholar 

Download references

Funding

This work was supported by Lille University, CNRS and Programme national d’aide à l’Accueil en Urgence des Scientifiques en Exil (PAUSE).

Author information

Authors and Affiliations

Authors

Contributions

The research was conducted by mutual contributions of all authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ammar Abdo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abdo, A., Pupin, M. LINGO-DL: a text-based approach for molecular similarity searching. J Comput Aided Mol Des 35, 657–665 (2021). https://doi.org/10.1007/s10822-021-00383-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-021-00383-9

Keywords

Navigation