Managing, profiling and analyzing a library of 2.6 million compounds gathered from 32 chemical providers

Monge, Aurélien; Arrault, Alban; Marot, Christophe; Morin-Allory, Luc

doi:10.1007/s11030-006-9033-5

Managing, profiling and analyzing a library of 2.6 million compounds gathered from 32 chemical providers

Full–length paper
Published: 21 September 2006

Volume 10, pages 389–403, (2006)
Cite this article

Molecular Diversity Aims and scope Submit manuscript

Aurélien Monge¹,
Alban Arrault¹,
Christophe Marot¹ &
…
Luc Morin-Allory¹

233 Accesses
61 Citations
3 Altmetric
Explore all metrics

Summary

The data for 3.8 million compounds from structural databases of 32 providers were gathered and stored in a single chemical database. Duplicates are removed using the IUPAC International Chemical Identifier. After this, 2.6 million compounds remain. Each database and the final one were studied in term of uniqueness, diversity, frameworks, ‘drug-like’ and ‘lead–like’ properties. This study also shows that there are more than 87 000 frameworks in the database. It contains 2.1 million ‘drug-like’ molecules among which, more than one million are ‘lead-like’. This study has been carried out using ‘ScreeningAssistant’, a software dedicated to chemical databases management and screening sets generation. Compounds are stored in a MySQL database and all the operations on this database are carried out by Java code. The druglikeness and leadlikeness are estimated with ‘in–house’ scores using functions to estimate convenience to properties; unicity using the InChI code and diversity using molecular frameworks and fingerprints. The software has been conceived in order to facilitate the update of the database. ‘ScreeningAssistant’ is freely available under the GPL license.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accessing the High-Throughput Screening Data Landscape

Compilation of Custom Compound/Bioactivity Datasets from Public Repositories

Molecular Property Diagnostic Suite Compound Library (MPDS-CL): a structure-based classification of the chemical space

Article 30 October 2023

Abbreviations

HBA:: H bond acceptor
HBD:: H bond donor
HTS:: high-throughput screening
InChI:: IUPAC International Chemical Identifier
JNI:: Java Native Interface
MW:: molecular weight
RO5:: rule-of-five
SCA:: stochastic clustering analysis
SSSR:: smallest set of smallest rings

References

Bradley, M.P., An overview of the diversity represented in commercially-available databases, J. Comput. Aided Mol. Des., 16 (2002) 299–300.
Article Google Scholar
Mozziconacci, J.C., Arnoult, E., Baurin, N., Marot, C. and Morin-Allory, L., Preparation of a molecular database from a set of 2 million compounds for virtual screening applications : Gathering, structural analysis and filtering, 9th Electronic Computational Chemistry Conference, World Wide Web, March (2003).
Sirois, S., Hatzakis, G., Wei, D., Du, Q., Chou, K.C., Assessment of chemical libraries for their druggability, Comput. Biol. Chem., 29 (2005) 55–67.
Article PubMed CAS Google Scholar
Baurin, N., Baker, R., Richardson, C., Chen, I., Foloppe, N., Potter, A., Jordan, A., Roughley, S., Parratt, M., Greaney, P., Morley, D. and Hubbard, R.E., Drug-like annotation and duplicate analysis of a 23-supplier chemical database totalling 2.7 million compounds, J. Chem. Inf. Comput. Sci., 44 (2004) 643–657.
Article PubMed CAS Google Scholar
Cummins, D.J., Andrews, C.W., Bentley, J.A. and Cory, M., Molecular diversity in chemical databases: Comparison of medicinal chemistry knowledge bases and databases of commercially available compounds, J. Chem. Inf. Comput. Sci., 36 (1996) 750–763.
Article PubMed CAS Google Scholar
Voigt, J.H., Bienfait, B., Wang, S. and Nicklaus, M.C., Comparison of the NCI open database with seven large chemical structural databases, J. Chem. Inf. Comput. Sci., 41 (2001) 702–712.
Article PubMed CAS Google Scholar
Monge, A., Screening assistant, http://screenassistant.sourceforge.net/
Wegner, J.K., JOELib, http://joelib.sourceforge.net
Corina. Molecular Networks GmbH. http://www.mol-net.com
The IUPAC International Chemical Identifier Project, http://www.iupac.org/inchi/
Murray-Rust, P., Rzepa, H.S., Stewart, J.J., Zhang, Y., A global resource for computational chemistry, J. Mol. Model., 11 (2005) 532–541.
Article PubMed CAS Google Scholar
Coles, S.J., Day, N.E., Murray-Rust, P., Rzepa, H.S. and Zhang, Y., Enhancement of the chemical semantic web through the use of InChI identifiers, Org. Biomol. Chem., 3 (2005) 1832–1834.
Article PubMed CAS Google Scholar
Prasanna, M.D., Vondrasek, J., Wlodawer, A. and Bhat, T.N., Application of InChI to curate, index, and query 3-D structures, Proteins, 60 (2005) 1–4.
Article PubMed CAS Google Scholar
Molecular Operating Environment (MOE), Chemical Computing, http://www.chemcomp.com
OEChem, OpenEye Scientific Software, http://www.eyesopen.com
Marvin, ChemAxon. http://www.chemaxon.com
Groupement De Service Chimiothèque Nationale, http://chimiotheque-nationale.enscm.fr
Reynolds, C.H., Druker, R. and Pfahle, L.B., Lead discovery using stochastic cluster analysis (SCA): A new method for clustering structurally similar compounds, J. Chem. Inf. Comput. Sci., 38 (1998) 305–312.
Article CAS Google Scholar
Xue, L., Godden, J.W. and Bajorath, J., Database searching for compounds with similar biological activity using short binary bit string representations of molecules, J. Chem. Inf. Comput. Sci., 39 (1999) 881–886.
Article PubMed CAS Google Scholar
Bemis, G.W. and Murcko, M.A., The properties of known drugs. 1. Molecular frameworks, J. Med. Chem., 39 (1996) 2887–2893.
Article PubMed CAS Google Scholar
Lajiness, M.S., Vieth, M. and Erickson, J., Molecular properties that influence oral drug-like behavior, Curr. Opin. Drug Discov. Devel., 7 (2004) 470–477.
PubMed CAS Google Scholar
Walters, W.P. and Murcko, M.A., Prediction of ‘drug-likeness’, Adv. Drug Delivery Rev., 54 (2002) 255–271.
Article CAS Google Scholar
Clark, D.E., Pickett, S.D., Computational methods for the prediction of ‘druglikeness’, Drug Discov. Today, 5 (2000), 49–58.
Article PubMed CAS Google Scholar
Muegge, I., Selection criteria for drug-like compounds, Med. Res. Rev., 23 (2003) 302–321.
Article PubMed CAS Google Scholar
Lipinski, C.A., Lombardo, F., Dominy, B.W. and Feeney, P.J., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings, Adv. Drug Deliv. Rev., 23 (1997) 3–25.
Article CAS Google Scholar
Lipinski, C.A., Lead- and drug-like compounds: The rule-of-five revolution, Drug Discov. Today, 1 (2004) 337–341.
CAS Google Scholar
Frimurer, T.M., Bywater, R., Nærum, L., Lauritsen, L.N. and Brunak, S., Improving the odds in discriminating “drug-like” from “non drug-like” compounds, J. Chem. Inf. Comput. Sci., 40 (2000), 1315–1324.
Article PubMed CAS Google Scholar
Oprea, T.I., Property distribution of drug-related chemical databases, J. Comput. Aided Mol. Des., 14 (2000) 251–264.
Article PubMed CAS Google Scholar
Xu, J., Stevenson, J., Drug-like index: A new approach to measure drug-like compounds and their diversity, J. Chem. Inf. Comput. Sci., 40 (2000) 1177–1187.
Article PubMed CAS Google Scholar
Veber, D.F., Johnson, S.R., Cheng, H.Y., Smith, B.R., Ward, K.W., Kopple, K.D., Molecular properties that influence the oral bioavailability of drug candidates, J. Med. Chem., 45 (2002) 2615–2623.
Article PubMed CAS Google Scholar
Zheng, S., Luo, X., Chen, G., Zhu, W., Shen, J., Chen, K. and Jiang, H., A new rapid and effective chemistry space filter in recognizing a druglike database, J. Chem. Inf. Comput. Sci., 45 (2005) 856–862.
CAS Google Scholar
Muegge, I., Heald, S.L. and Brittelli, D., Simple selection criteria for drug-like chemical matter, J. Med. Chem., 44 (2001) 1841–1846.
Article PubMed CAS Google Scholar
Zernov, V.V., Balakin, K.V., Ivaschenko, A.A., Savchuk, N.P. and Pletnev, I.V., Drug Discovery using support vector machines. The case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions, J. Chem. Inf. Comput. Sci., 43 (2003), 2048–2056.
Article PubMed CAS Google Scholar
Ajay, A., Walters, W.P. and Murcko, M.A., Can we learn to distinguish between “drug-like” and “nondrug-like” molecules?, J. Med. Chem., 41 (1998) 3314–3324.
Article PubMed CAS Google Scholar
Sadowski, J. and Kubinyi, H., A scoring scheme for discriminating between drugs and nondrugs, J. Med. Chem., 41 (1998) 3325–3329.
Article PubMed CAS Google Scholar
Charifson, P.S. and Walters, W.P., Filtering databases and chemical libraries, J. Comput. Aided Mol. Des., 16 (2002) 311–323.
Article PubMed CAS Google Scholar
Rishton, G.M., Reactive compounds and in vitro false positives in HTS, Drug Discov. Today, 2 (1997) 382–384.
Article CAS Google Scholar
Wildman, S.A. and Crippen, G.M., Prediction of physicochemical parameters by atomic contributions, J. Chem. Inf. Comput. Sci., 39 (1999) 868–873.
Article CAS Google Scholar
Hann, M.M., Leach, A.R. and Harper, G., Molecular complexity and its impact on the probability of finding leads for drug discovery, J. Chem. Inf. Comput. Sci., 41 (2001) 856–864.
Article PubMed CAS Google Scholar
Oprea, T.I., Current trends in lead discovery: Are we looking for the appropriate properties?, J. Comput. Aided Mol. Des., 16 (2002) 325–334.
Article PubMed CAS Google Scholar
Davis, A.M., Teague, S.J. and Kleywegt, G.J., Application and limitations of X-ray crystallographic data in structure-based ligand and drug design, J. Chem. Inf. Comput. Sci., 42 (2003) 2718–2736.
CAS Google Scholar
Hann, M.M. and Oprea, T.I., Pursuing the leadlikeness concept in pharmaceutical research, Curr. Opin. Chem. Biol., 8 (2004) 255–263.
Article PubMed CAS Google Scholar
Wenlock, M.C., Austin, R.P., Barton, P., Davis, A.M. and Leeson P.D., A comparison of physiochemical property profiles of development and marketed oral drugs, J. Med. Chem., 46 (2003) 1250–1256.
Article PubMed CAS Google Scholar
Hou, T.J., Xia, K., Zhang, W. and Xu, X.J., ADME evaluation in drug discovery. 4. Prediction of aqueous solubility based on atom contribution approach, J. Chem. Inf. Comput. Sci., 44 (2004) 266–275.
Article PubMed CAS Google Scholar
Ertl, P., Rohde, B. and Selzer, P., Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties, J. Med. Chem., 43 (2000) 3714–3717.
Article PubMed CAS Google Scholar
Palm, K., Stenberg, P., Luthman, K. and Artursson, P., Polar molecular surface properties predict the intestinal absorption of drugs in humans, Pharm. Res., 14 (1997) 568–571.
Article PubMed CAS Google Scholar

Download references

Author information

Authors and Affiliations

Institut de Chimie Organique et Analytique, UMR CNRS 6005, Université d’Orléans, BP 6759, 45067, Orléans Cedex 2, France
Aurélien Monge, Alban Arrault, Christophe Marot & Luc Morin-Allory

Authors

Aurélien Monge
View author publications
You can also search for this author in PubMed Google Scholar
Alban Arrault
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Marot
View author publications
You can also search for this author in PubMed Google Scholar
Luc Morin-Allory
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aurélien Monge.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Monge, A., Arrault, A., Marot, C. et al. Managing, profiling and analyzing a library of 2.6 million compounds gathered from 32 chemical providers. Mol Divers 10, 389–403 (2006). https://doi.org/10.1007/s11030-006-9033-5

Download citation

Received: 24 November 2005
Accepted: 07 April 2006
Published: 21 September 2006
Issue Date: August 2006
DOI: https://doi.org/10.1007/s11030-006-9033-5

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Managing, profiling and analyzing a library of 2.6 million compounds gathered from 32 chemical providers

Summary

Access this article

Similar content being viewed by others

Accessing the High-Throughput Screening Data Landscape

Compilation of Custom Compound/Bioactivity Datasets from Public Repositories

Molecular Property Diagnostic Suite Compound Library (MPDS-CL): a structure-based classification of the chemical space

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Key words

Navigation

Managing, profiling and analyzing a library of 2.6 million compounds gathered from 32 chemical providers

Summary

Access this article

Similar content being viewed by others

Accessing the High-Throughput Screening Data Landscape

Compilation of Custom Compound/Bioactivity Datasets from Public Repositories

Molecular Property Diagnostic Suite Compound Library (MPDS-CL): a structure-based classification of the chemical space

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation