Diversifying chemical libraries with generative topographic mapping

  • Arkadii Lin
  • Bernd BeckEmail author
  • Dragos Horvath
  • Gilles Marcou
  • Alexandre VarnekEmail author


Generative topographic mapping was used to investigate the possibility to diversify the in-house compounds collection of Boehringer Ingelheim (BI). For this purpose, a 2D map covering the relevant chemical space was trained, and the BI compound library was compared to the Aldrich-Market Select (AMS) database of more than 8M purchasable compounds. In order to discover new (sub)structures, the “AutoZoom” tool was developed and applied in order to analyze chemotypes of molecules residing in heavily populated zones of a map and to extract the corresponding maximum common substructures. A set of 401K new structures from the AMS database was retrieved and checked for drug-likeness and biological activity.


Generative topographic mapping Chemical library diversity enrichment Big data 



Generative topographic mapping


Frame set




Applicability domain


Radial basis function




Aldrich Market Select


Boehringer Ingelheim


Maximum common substructure



The authors thank Boehringer Ingelheim Pharma GmbH & Co KG for the provided data.

Author contributions

The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.


The project leading to this article has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Grant Agreement No. 676434, “Big Data in Chemistry” (“BIGCHEM”,


  1. 1.
    Aladinskiy V, Sanchez-Lengeling B, Aspuru-Guzik A et al (2018) Reinforced adversarial neural computer for de novo molecular design. J Chem Inf Model 58:1194–1204. CrossRefPubMedGoogle Scholar
  2. 2.
    Kang S, Cho K (2019) Conditional molecular design with deep generative models. J Chem Inf Model 59:43–52. CrossRefPubMedGoogle Scholar
  3. 3.
    Schneider P, Schneider G (2016) De novo design at the edge of chaos: miniperspective. J Med Chem 59:4077–4086CrossRefPubMedGoogle Scholar
  4. 4.
    Sattarov B, Baskin II, Horvath D et al (2019) De novo molecular design by combining deep autoencoder recurrent neural networks with generative topographic mapping. J Chem Inf Model 59:1182–1196. CrossRefPubMedGoogle Scholar
  5. 5.
    Ruddigkeit L, Van Deursen R, Blum LC, Reymond JL (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model 52:2864–2875. CrossRefPubMedGoogle Scholar
  6. 6.
    Chang J-W, Jin D-S (2003) A new cell-based clustering method for large, high-dimensional data in data mining applications. In: Proceedings of the 2002 ACM symposium on Applied computing. ACM, p 503Google Scholar
  7. 7.
    Medina-Franco JL, Maggiora GM, Giulianotti MA et al (2007) A similarity-based data-fusion approach to the visual characterization and comparison of compound databases. Chem Biol Drug Des 70:393–412. CrossRefPubMedGoogle Scholar
  8. 8.
    Akella LB, DeCaprio D (2010) Cheminformatics approaches to analyze diversity in compound screening libraries. Curr Opin Chem Biol 14:325–330CrossRefPubMedGoogle Scholar
  9. 9.
    Bernard P, Golbraikh A, Kireev D et al (1998) Comparison of chemical databases: analysis of molecular diversity with self organising maps (SOM). Analusis 26:333–341. CrossRefGoogle Scholar
  10. 10.
    Kireeva N, Baskin II, Gaspar HA et al (2012) Generative topographic mapping (GTM): universal tool for data visualization, structure-activity modeling and dataset comparison. Mol Inform 31:301–312. CrossRefPubMedGoogle Scholar
  11. 11.
    Gaspar HA, Baskin II, Marcou G et al (2015) GTM-based QSAR models and their applicability domains. Mol Inform 34:348–356. CrossRefPubMedGoogle Scholar
  12. 12.
    Lin A, Horvath D, Afonina V et al (2018) Mapping of the available chemical space versus the chemical universe of lead-like compounds. ChemMedChem 13:540–554. CrossRefPubMedGoogle Scholar
  13. 13.
    Tino P, Nabney I (2002) Hierarchical GTM: constructing localized nonlinear projection manifolds in a principled way. IEEE Trans Pattern Anal Mach Intell 24:639–656. CrossRefGoogle Scholar
  14. 14.
    Lin A, Horvath D, Marcou G et al (2019) Multi-task generative topographic mapping in virtual screening. J Comput Aided Mol Des 33:331–343. CrossRefPubMedGoogle Scholar
  15. 15.
    Casciuc I, Zabolotna Y, Horvath D et al (2019) Virtual screening with generative topographic maps: how many maps are required? J Chem Inf Model 59:564–572. CrossRefPubMedGoogle Scholar
  16. 16.
    ChemAxon Standardizer. Accessed 1 Feb 2019
  17. 17.
    ChemAxon JChem. Accessed 1 Feb 2019
  18. 18.
    Bishop CM, Svensén M, Williams CKI (1998) GTM: the generative topographic mapping. Neural Comput 10:215–234. CrossRefGoogle Scholar
  19. 19.
    Sidorov P, Viira B, Davioud-Charvet E et al (2017) QSAR modeling and chemical space analysis of antimalarial compounds. J Comput Aided Mol Des 31:441–451. CrossRefPubMedGoogle Scholar
  20. 20.
    Monev V (2004) Introduction to similarity searching in chemistry *. Match-Commun Math Comput Chem 51:7–38Google Scholar
  21. 21.
    (2019) RDKit: Open-source cheminformatics. Accessed 1 Feb 2019
  22. 22.
    Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754. CrossRefPubMedGoogle Scholar
  23. 23.
    Gaspar HA, Baskin II, Marcou G et al (2015) Chemical data visualization and analysis with incremental generative topographic mapping: big data challenge. J Chem Inf Model 55:84–94. CrossRefPubMedGoogle Scholar
  24. 24.
    Sidorov P, Gaspar H, Marcou G et al (2015) Mappability of drug-like space: towards a polypharmacologically competent map of drug-relevant compounds. J Comput Aided Mol Des 29:1087–1108. CrossRefPubMedGoogle Scholar
  25. 25.
    Volochnyuk DM, Ryabukhin SV, Moroz YS et al (2019) Evolution of commercially available compounds for HTS. Drug Discov Today 24:390–402. CrossRefPubMedGoogle Scholar
  26. 26.
    Dauber-Osguthorpe P, Roberts VA, Osguthorpe DJ et al (1988) Structure and energetics of ligand binding to proteins: escherichia coli dihydrofolate reductase-trimethoprim, a drug-receptor system. Proteins Struct Funct Bioinform 4:31–47. CrossRefGoogle Scholar
  27. 27.
    Ruggiu F, Marcou G, Varnek A, Horvath D (2010) ISIDA property-labelled fragment descriptors. Mol Inform 29:855–868. CrossRefPubMedGoogle Scholar
  28. 28.
    Marcou G, Solov’ev VP, Horvath D, Varnek A (2017) ISIDA fragmentor—user manualGoogle Scholar
  29. 29.
    Horvath D, Brown J, Marcou G, Varnek A (2014) An evolutionary optimizer of libsvm models. Challenges 5:450–472CrossRefGoogle Scholar
  30. 30.
    Klimenko K, Marcou G, Horvath D, Varnek A (2016) Chemical space mapping and structure-activity analysis of the ChEMBL antiviral compound set. J Chem Inf Model 56:1438–1454. CrossRefPubMedGoogle Scholar
  31. 31.
    Hariharan R, Janakiraman A, Nilakantan R et al (2011) MultiMCS: a fast algorithm for the maximum common substructure problem on multiple molecules. J Chem Inf Model 51:788–806. CrossRefPubMedGoogle Scholar
  32. 32.
    Oliphant TE (2006) A guide to NumPy. Tregol Publishing, USAGoogle Scholar
  33. 33.
    Oliphant TE (2007) Python for scientific computing. Comput Sci Eng 9:10–20. CrossRefGoogle Scholar
  34. 34.
    Inc. PT (2015) Collaborative data science. In: Plotly Technol. Inc. Accessed 1 Feb 2019
  35. 35.
    Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (2012) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 64:4–17. CrossRefGoogle Scholar
  36. 36.
    Brenk R, Schipani A, James D et al (2008) Lessons learnt from assembling screening libraries for drug discovery for neglected diseases. ChemMedChem Chem Enabling Drug Discov 3:435–444Google Scholar
  37. 37.
    Baell JB, Holloway GA (2010) New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem 53:2719–2740CrossRefPubMedGoogle Scholar
  38. 38.
    Doveston RG, Tosatti P, Dow M et al (2015) A unified lead-oriented synthesis of over fifty molecular scaffolds. Org Biomol Chem 13:859–865CrossRefPubMedGoogle Scholar
  39. 39.
    Jadhav A, Ferreira RS, Klumpp C et al (2009) Quantitative analyses of aggregation, autofluorescence, and reactivity artifacts in a screen for inhibitors of a thiol protease. J Med Chem 53:37–51CrossRefGoogle Scholar
  40. 40.
    Gaulton A, Hersey A, Nowotka ML et al (2017) The ChEMBL database in 2017. Nucleic Acids Res 45:D945–D954. CrossRefPubMedGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Laboratory of Chemoinformatics, Faculty of ChemistryUniversity of StrasbourgStrasbourgFrance
  2. 2.Department of Medicinal ChemistryBoehringer Ingelheim Pharma GmbH & Co. KGBiberach an der RissGermany

Personalised recommendations