LINGO-DL: a text-based approach for molecular similarity searching

Abdo, Ammar; Pupin, Maude

doi:10.1007/s10822-021-00383-9

LINGO-DL: a text-based approach for molecular similarity searching

Published: 02 April 2021

Volume 35, pages 657–665, (2021)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

405 Accesses
1 Citation
Explore all metrics

Abstract

The line notations of chemical structures are more compact than those of graphs and connection tables, so they can be useful for storing and transferring a large number of molecular structures. The simplified molecular input line system (SMILES) representation is the most extensively used, as it is much easier to utilise and comprehend than others, and it can be generated automatically from connection tables. A SMILES represents and encodes the molecule structure. It has been used by an existing method, LINGO, to calculate the molecular similarities and predict the structure-related properties. The LINGO method decomposes a canonical SMILES into a set of substrings of four characters referred to as LINGOs. The purpose of LINGO method is to measure the similarity between a pair of molecules by comparing the LINGOs that occur in each molecule. This paper aims to introduce an alternative version of the LINGO method using LINGOs of different lengths, called LINGO-DL. LINGO-DL is based on the fragmentation of canonical SMILES into substrings of three different lengths rather than one in LINGO method. Retrospective virtual screening experiments with MDDR, DUD, and MUV datasets show that the LINGO-DL outperforms the LINGO method, especially when the active molecules being sought have a high degree of structural heterogeneity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Article Open access 17 April 2024

Robocrystallographer: automated crystal structure text descriptions and analysis

Article 20 September 2019

On the relevance of query definition in the performance of 3D ligand-based virtual screening

Article Open access 04 April 2024

References

Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. J Chem Inf Comput Sci 38:983–996. https://doi.org/10.1021/ci9800211
Article CAS Google Scholar
Stumpfe D, Bajorath J (2011) Similarity searching. WIREs Comput Mol Sci 1:260–282. https://doi.org/10.1002/wcms.23
Article CAS Google Scholar
Banegas-Luna A-J, Cerón-Carrasco JP, Pérez-Sánchez H (2018) A review of ligand-based virtual screening web tools and screening algorithms in large molecular databases in the age of big data. Future Med Chem 10:2641–2658. https://doi.org/10.4155/fmc-2018-0076
Article CAS PubMed Google Scholar
Sheridan RP, Kearsley SK (2002) Why do we need so many chemical similarity search methods? Drug Discov Today 7:903–911. https://doi.org/10.1016/S1359-6446(02)02411-X
Article PubMed Google Scholar
Sliwoski G, Kothiwale S, Meiler J, Lowe EW (2014) Computational methods in drug discovery. Pharmacol Rev 66:334–395. https://doi.org/10.1124/pr.112.007336
Article CAS PubMed PubMed Central Google Scholar
Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11:1046–1053. https://doi.org/10.1016/j.drudis.2006.10.005
Article CAS PubMed Google Scholar
Leach AR, Gillet VJ (2007) An introduction to chemoinformatics. Springer, Dordrecht
Book Google Scholar
Nikolova N, Jaworska J (2003) Approaches to measure chemical similarity—a review. QSAR Comb Sci 22:1006–1026. https://doi.org/10.1002/qsar.200330831
Article CAS Google Scholar
Bender A, Glen RC (2004) Molecular similarity: a key technique in molecular informatics. Org Biomol Chem 2:3204–3218. https://doi.org/10.1039/B409813G
Article CAS PubMed Google Scholar
Maldonado AG, Doucet JP, Petitjean M, Fan B-T (2006) Molecular similarity and diversity in chemoinformatics: from theory to applications. Mol Divers 10:39–79. https://doi.org/10.1007/s11030-006-8697-1
Article CAS PubMed Google Scholar
Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M, Davies JW (2009) How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model 49:108–119. https://doi.org/10.1021/ci800249s
Article CAS PubMed Google Scholar
Kearsley SK, Sallamack S, Fluder EM, Andose JD, Mosley RT, Sheridan RP (1996) Chemical similarity using physiochemical property descriptors. J Chem Inf Comput Sci 36:118–127. https://doi.org/10.1021/ci950274j
Article CAS Google Scholar
Brown RD, Martin YC (1997) The information content of 2D and 3D structural descriptors relevant to ligand-receptor binding. J Chem Inf Comput Sci 37:1–9. https://doi.org/10.1021/ci960373c
Article CAS Google Scholar
Schuffenhauer A, Gillet VJ, Willett P (2000) Similarity searching in files of three-dimensional chemical structures: analysis of the BIOSTER database using two-dimensional fingerprints and molecular field descriptors. J Chem Inf Comput Sci 40:295–307. https://doi.org/10.1021/ci990263g
Article CAS PubMed Google Scholar
Brown RD, Martin YC (1996) Use of structure−activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Comput Sci 36:572–584. https://doi.org/10.1021/ci9501047
Article CAS Google Scholar
Matter H, Pötter T (1999) Comparing 3D pharmacophore triplets and 2D fingerprints for selecting diverse compound subsets. J Chem Inf Comput Sci 39:1211–1225. https://doi.org/10.1021/ci980185h
Article CAS Google Scholar
Matter H (1997) Selecting optimally diverse compounds from structure databases: a validation study of two-dimensional and three-dimensional molecular descriptors. J Med Chem 40:1219–1229. https://doi.org/10.1021/jm960352+
Article CAS PubMed Google Scholar
WISWESSER WJ (1952) The Wiswesser line formula notation. Chem Eng News Arch 30:3523–3526. https://doi.org/10.1021/cen-v030n034.p3523
Article Google Scholar
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36. https://doi.org/10.1021/ci00057a005
Article CAS Google Scholar
Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29:97–101. https://doi.org/10.1021/ci00062a008
Article CAS Google Scholar
Barnard MJ, Jochum CJ, Welford SM (1989) ROSDAL: a universal structure/substructure representation for PC-host communication. In: Warr WA (ed) Chemical structure information systemes—interfaces, communication, and standards. American Chemical Society, Washington, DC, pp 76–81
Chapter Google Scholar
Ash S, Cline MA, Homer RW, Hurst T, Smith GB (1997) SYBYL line notation (SLN): a versatile language for chemical structure representation. J Chem Inf Comput Sci 37:71–79. https://doi.org/10.1021/ci960109j
Article CAS Google Scholar
Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD (2008) SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries. J Chem Inf Model 48:2294–2307. https://doi.org/10.1021/ci7004687
Article CAS PubMed Google Scholar
Vidal D, Thormann M, Pons M (2005) LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J Chem Inf Model 45:386–393. https://doi.org/10.1021/ci0496797
Article CAS PubMed Google Scholar
Vidal D, Thormann M, Pons M (2006) A novel search engine for virtual screening of very large databases. J Chem Inf Model 46:836–843. https://doi.org/10.1021/ci050458q
Article CAS PubMed Google Scholar
Schwartz J, Awale M, Reymond J-L (2013) SMIfp (SMILES fingerprint) chemical space for virtual screening and visualization of large databases of organic molecules. J Chem Inf Model 53:1979–1989. https://doi.org/10.1021/ci400206h
Article CAS PubMed Google Scholar
Identifying Structure–Property Relationships through SMILES Syntax Analysis with Self-Attention Mechanism | Journal of Chemical Information and Modeling. https://pubs.acs.org/doi/10.1021/acs.jcim.8b00803. Accessed 17 Jun 2020
Grant JA, Haigh JA, Pickup BT, Nicholls A, Sayle RA (2006) Lingos, finite state machines, and fast similarity searching. J Chem Inf Model 46:1912–1918. https://doi.org/10.1021/ci6002152
Article CAS PubMed Google Scholar
BIOVIA Databases | Bioactivity Databases: MDDR
Huang N, Shoichet BK, Irwin JJ (2006) Benchmarking sets for molecular docking. J Med Chem 49:6789–6801. https://doi.org/10.1021/jm0608356
Article CAS PubMed PubMed Central Google Scholar
Rohrer SG, Baumann KMUV (2009) Data sets for virtual screening based on PubChem bioactivity data. J Chem Inf Model 49:169–184. https://doi.org/10.1021/ci8002649
Article CAS PubMed Google Scholar
Pipeline Pilot Scientific Application Overview | Dassault Systèmes BIOVIA
Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A (2006) New methods for ligand-based virtual screening: use of data fusion and machine learning to enhance the effectiveness of similarity searching. J Chem Inf Model 46:462–470. https://doi.org/10.1021/ci050348j
Article CAS PubMed Google Scholar
Jahn A, Hinselmann G, Fechner N, Zell A (2009) Optimal assignment methods for ligand-based virtual screening. J Cheminform 1:14. https://doi.org/10.1186/1758-2946-1-14
Article CAS PubMed PubMed Central Google Scholar
Abdo A, Chen B, Mueller C, Salim N, Willett P (2010) Ligand-based virtual screening using bayesian networks. J Chem Inf Model 50:1012–1020. https://doi.org/10.1021/ci100090p
Article CAS PubMed Google Scholar
Abdo A, Saeed F, Hamza H, Ahmed A, Salim N (2012) Ligand expansion in ligand-based virtual screening using relevance feedback. J Comput Aided Mol Des 26:279–287. https://doi.org/10.1007/s10822-012-9543-4
Article CAS PubMed Google Scholar
Cincilla G, Thormann M, Pons M (2010) Structuring chemical space: similarity-based characterization of the PubChem database. Mol Inf 29:37–49. https://doi.org/10.1002/minf.200900015
Article CAS Google Scholar
Abdo A, Salim N (2011) New fragment weighting scheme for the Bayesian inference network in ligand-based virtual screening. J Chem Inf Model 51:25–32. https://doi.org/10.1021/ci100232h
Article CAS PubMed Google Scholar

Download references

Funding

This work was supported by Lille University, CNRS and Programme national d’aide à l’Accueil en Urgence des Scientifiques en Exil (PAUSE).

Author information

Authors and Affiliations

Universite de Lille, Villeneuve d’Ascq cedex, France
Ammar Abdo & Maude Pupin

Authors

Ammar Abdo
View author publications
You can also search for this author in PubMed Google Scholar
Maude Pupin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The research was conducted by mutual contributions of all authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ammar Abdo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abdo, A., Pupin, M. LINGO-DL: a text-based approach for molecular similarity searching. J Comput Aided Mol Des 35, 657–665 (2021). https://doi.org/10.1007/s10822-021-00383-9

Download citation

Received: 18 December 2020
Accepted: 26 March 2021
Published: 02 April 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s10822-021-00383-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LINGO-DL: a text-based approach for molecular similarity searching

Abstract

Access this article

Similar content being viewed by others

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Robocrystallographer: automated crystal structure text descriptions and analysis

On the relevance of query definition in the performance of 3D ligand-based virtual screening

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

LINGO-DL: a text-based approach for molecular similarity searching

Abstract

Access this article

Similar content being viewed by others

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Robocrystallographer: automated crystal structure text descriptions and analysis

On the relevance of query definition in the performance of 3D ligand-based virtual screening

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation