Abstract
The line notations of chemical structures are more compact than those of graphs and connection tables, so they can be useful for storing and transferring a large number of molecular structures. The simplified molecular input line system (SMILES) representation is the most extensively used, as it is much easier to utilise and comprehend than others, and it can be generated automatically from connection tables. A SMILES represents and encodes the molecule structure. It has been used by an existing method, LINGO, to calculate the molecular similarities and predict the structure-related properties. The LINGO method decomposes a canonical SMILES into a set of substrings of four characters referred to as LINGOs. The purpose of LINGO method is to measure the similarity between a pair of molecules by comparing the LINGOs that occur in each molecule. This paper aims to introduce an alternative version of the LINGO method using LINGOs of different lengths, called LINGO-DL. LINGO-DL is based on the fragmentation of canonical SMILES into substrings of three different lengths rather than one in LINGO method. Retrospective virtual screening experiments with MDDR, DUD, and MUV datasets show that the LINGO-DL outperforms the LINGO method, especially when the active molecules being sought have a high degree of structural heterogeneity.
Similar content being viewed by others
References
Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. J Chem Inf Comput Sci 38:983–996. https://doi.org/10.1021/ci9800211
Stumpfe D, Bajorath J (2011) Similarity searching. WIREs Comput Mol Sci 1:260–282. https://doi.org/10.1002/wcms.23
Banegas-Luna A-J, Cerón-Carrasco JP, Pérez-Sánchez H (2018) A review of ligand-based virtual screening web tools and screening algorithms in large molecular databases in the age of big data. Future Med Chem 10:2641–2658. https://doi.org/10.4155/fmc-2018-0076
Sheridan RP, Kearsley SK (2002) Why do we need so many chemical similarity search methods? Drug Discov Today 7:903–911. https://doi.org/10.1016/S1359-6446(02)02411-X
Sliwoski G, Kothiwale S, Meiler J, Lowe EW (2014) Computational methods in drug discovery. Pharmacol Rev 66:334–395. https://doi.org/10.1124/pr.112.007336
Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11:1046–1053. https://doi.org/10.1016/j.drudis.2006.10.005
Leach AR, Gillet VJ (2007) An introduction to chemoinformatics. Springer, Dordrecht
Nikolova N, Jaworska J (2003) Approaches to measure chemical similarity—a review. QSAR Comb Sci 22:1006–1026. https://doi.org/10.1002/qsar.200330831
Bender A, Glen RC (2004) Molecular similarity: a key technique in molecular informatics. Org Biomol Chem 2:3204–3218. https://doi.org/10.1039/B409813G
Maldonado AG, Doucet JP, Petitjean M, Fan B-T (2006) Molecular similarity and diversity in chemoinformatics: from theory to applications. Mol Divers 10:39–79. https://doi.org/10.1007/s11030-006-8697-1
Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M, Davies JW (2009) How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model 49:108–119. https://doi.org/10.1021/ci800249s
Kearsley SK, Sallamack S, Fluder EM, Andose JD, Mosley RT, Sheridan RP (1996) Chemical similarity using physiochemical property descriptors. J Chem Inf Comput Sci 36:118–127. https://doi.org/10.1021/ci950274j
Brown RD, Martin YC (1997) The information content of 2D and 3D structural descriptors relevant to ligand-receptor binding. J Chem Inf Comput Sci 37:1–9. https://doi.org/10.1021/ci960373c
Schuffenhauer A, Gillet VJ, Willett P (2000) Similarity searching in files of three-dimensional chemical structures: analysis of the BIOSTER database using two-dimensional fingerprints and molecular field descriptors. J Chem Inf Comput Sci 40:295–307. https://doi.org/10.1021/ci990263g
Brown RD, Martin YC (1996) Use of structure−activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Comput Sci 36:572–584. https://doi.org/10.1021/ci9501047
Matter H, Pötter T (1999) Comparing 3D pharmacophore triplets and 2D fingerprints for selecting diverse compound subsets. J Chem Inf Comput Sci 39:1211–1225. https://doi.org/10.1021/ci980185h
Matter H (1997) Selecting optimally diverse compounds from structure databases: a validation study of two-dimensional and three-dimensional molecular descriptors. J Med Chem 40:1219–1229. https://doi.org/10.1021/jm960352+
WISWESSER WJ (1952) The Wiswesser line formula notation. Chem Eng News Arch 30:3523–3526. https://doi.org/10.1021/cen-v030n034.p3523
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36. https://doi.org/10.1021/ci00057a005
Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29:97–101. https://doi.org/10.1021/ci00062a008
Barnard MJ, Jochum CJ, Welford SM (1989) ROSDAL: a universal structure/substructure representation for PC-host communication. In: Warr WA (ed) Chemical structure information systemes—interfaces, communication, and standards. American Chemical Society, Washington, DC, pp 76–81
Ash S, Cline MA, Homer RW, Hurst T, Smith GB (1997) SYBYL line notation (SLN): a versatile language for chemical structure representation. J Chem Inf Comput Sci 37:71–79. https://doi.org/10.1021/ci960109j
Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD (2008) SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries. J Chem Inf Model 48:2294–2307. https://doi.org/10.1021/ci7004687
Vidal D, Thormann M, Pons M (2005) LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J Chem Inf Model 45:386–393. https://doi.org/10.1021/ci0496797
Vidal D, Thormann M, Pons M (2006) A novel search engine for virtual screening of very large databases. J Chem Inf Model 46:836–843. https://doi.org/10.1021/ci050458q
Schwartz J, Awale M, Reymond J-L (2013) SMIfp (SMILES fingerprint) chemical space for virtual screening and visualization of large databases of organic molecules. J Chem Inf Model 53:1979–1989. https://doi.org/10.1021/ci400206h
Identifying Structure–Property Relationships through SMILES Syntax Analysis with Self-Attention Mechanism | Journal of Chemical Information and Modeling. https://pubs.acs.org/doi/10.1021/acs.jcim.8b00803. Accessed 17 Jun 2020
Grant JA, Haigh JA, Pickup BT, Nicholls A, Sayle RA (2006) Lingos, finite state machines, and fast similarity searching. J Chem Inf Model 46:1912–1918. https://doi.org/10.1021/ci6002152
BIOVIA Databases | Bioactivity Databases: MDDR
Huang N, Shoichet BK, Irwin JJ (2006) Benchmarking sets for molecular docking. J Med Chem 49:6789–6801. https://doi.org/10.1021/jm0608356
Rohrer SG, Baumann KMUV (2009) Data sets for virtual screening based on PubChem bioactivity data. J Chem Inf Model 49:169–184. https://doi.org/10.1021/ci8002649
Pipeline Pilot Scientific Application Overview | Dassault Systèmes BIOVIA
Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A (2006) New methods for ligand-based virtual screening: use of data fusion and machine learning to enhance the effectiveness of similarity searching. J Chem Inf Model 46:462–470. https://doi.org/10.1021/ci050348j
Jahn A, Hinselmann G, Fechner N, Zell A (2009) Optimal assignment methods for ligand-based virtual screening. J Cheminform 1:14. https://doi.org/10.1186/1758-2946-1-14
Abdo A, Chen B, Mueller C, Salim N, Willett P (2010) Ligand-based virtual screening using bayesian networks. J Chem Inf Model 50:1012–1020. https://doi.org/10.1021/ci100090p
Abdo A, Saeed F, Hamza H, Ahmed A, Salim N (2012) Ligand expansion in ligand-based virtual screening using relevance feedback. J Comput Aided Mol Des 26:279–287. https://doi.org/10.1007/s10822-012-9543-4
Cincilla G, Thormann M, Pons M (2010) Structuring chemical space: similarity-based characterization of the PubChem database. Mol Inf 29:37–49. https://doi.org/10.1002/minf.200900015
Abdo A, Salim N (2011) New fragment weighting scheme for the Bayesian inference network in ligand-based virtual screening. J Chem Inf Model 51:25–32. https://doi.org/10.1021/ci100232h
Funding
This work was supported by Lille University, CNRS and Programme national d’aide à l’Accueil en Urgence des Scientifiques en Exil (PAUSE).
Author information
Authors and Affiliations
Contributions
The research was conducted by mutual contributions of all authors. All authors read and approved the final manuscript.
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Abdo, A., Pupin, M. LINGO-DL: a text-based approach for molecular similarity searching. J Comput Aided Mol Des 35, 657–665 (2021). https://doi.org/10.1007/s10822-021-00383-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-021-00383-9