Diversifying chemical libraries with generative topographic mapping


Generative topographic mapping was used to investigate the possibility to diversify the in-house compounds collection of Boehringer Ingelheim (BI). For this purpose, a 2D map covering the relevant chemical space was trained, and the BI compound library was compared to the Aldrich-Market Select (AMS) database of more than 8M purchasable compounds. In order to discover new (sub)structures, the “AutoZoom” tool was developed and applied in order to analyze chemotypes of molecules residing in heavily populated zones of a map and to extract the corresponding maximum common substructures. A set of 401K new structures from the AMS database was retrieved and checked for drug-likeness and biological activity.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8



Generative topographic mapping


Frame set




Applicability domain


Radial basis function




Aldrich Market Select


Boehringer Ingelheim


Maximum common substructure


  1. 1.

    Aladinskiy V, Sanchez-Lengeling B, Aspuru-Guzik A et al (2018) Reinforced adversarial neural computer for de novo molecular design. J Chem Inf Model 58:1194–1204. https://doi.org/10.1021/acs.jcim.7b00690

    CAS  Article  PubMed  Google Scholar 

  2. 2.

    Kang S, Cho K (2019) Conditional molecular design with deep generative models. J Chem Inf Model 59:43–52. https://doi.org/10.1021/acs.jcim.8b00263

    CAS  Article  PubMed  Google Scholar 

  3. 3.

    Schneider P, Schneider G (2016) De novo design at the edge of chaos: miniperspective. J Med Chem 59:4077–4086

    CAS  Article  Google Scholar 

  4. 4.

    Sattarov B, Baskin II, Horvath D et al (2019) De novo molecular design by combining deep autoencoder recurrent neural networks with generative topographic mapping. J Chem Inf Model 59:1182–1196. https://doi.org/10.1021/acs.jcim.8b00751

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Ruddigkeit L, Van Deursen R, Blum LC, Reymond JL (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model 52:2864–2875. https://doi.org/10.1021/ci300415d

    CAS  Article  Google Scholar 

  6. 6.

    Chang J-W, Jin D-S (2003) A new cell-based clustering method for large, high-dimensional data in data mining applications. In: Proceedings of the 2002 ACM symposium on Applied computing. ACM, p 503

  7. 7.

    Medina-Franco JL, Maggiora GM, Giulianotti MA et al (2007) A similarity-based data-fusion approach to the visual characterization and comparison of compound databases. Chem Biol Drug Des 70:393–412. https://doi.org/10.1111/j.1747-0285.2007.00579.x

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Akella LB, DeCaprio D (2010) Cheminformatics approaches to analyze diversity in compound screening libraries. Curr Opin Chem Biol 14:325–330

    CAS  Article  Google Scholar 

  9. 9.

    Bernard P, Golbraikh A, Kireev D et al (1998) Comparison of chemical databases: analysis of molecular diversity with self organising maps (SOM). Analusis 26:333–341. https://doi.org/10.1051/analusis:1998182

    CAS  Article  Google Scholar 

  10. 10.

    Kireeva N, Baskin II, Gaspar HA et al (2012) Generative topographic mapping (GTM): universal tool for data visualization, structure-activity modeling and dataset comparison. Mol Inform 31:301–312. https://doi.org/10.1002/minf.201100163

    CAS  Article  PubMed  Google Scholar 

  11. 11.

    Gaspar HA, Baskin II, Marcou G et al (2015) GTM-based QSAR models and their applicability domains. Mol Inform 34:348–356. https://doi.org/10.1002/minf.201400153

    CAS  Article  PubMed  Google Scholar 

  12. 12.

    Lin A, Horvath D, Afonina V et al (2018) Mapping of the available chemical space versus the chemical universe of lead-like compounds. ChemMedChem 13:540–554. https://doi.org/10.1002/cmdc.201700561

    CAS  Article  PubMed  Google Scholar 

  13. 13.

    Tino P, Nabney I (2002) Hierarchical GTM: constructing localized nonlinear projection manifolds in a principled way. IEEE Trans Pattern Anal Mach Intell 24:639–656. https://doi.org/10.1109/34.1000238

    Article  Google Scholar 

  14. 14.

    Lin A, Horvath D, Marcou G et al (2019) Multi-task generative topographic mapping in virtual screening. J Comput Aided Mol Des 33:331–343. https://doi.org/10.1007/s10822-019-00188-x

    CAS  Article  PubMed  Google Scholar 

  15. 15.

    Casciuc I, Zabolotna Y, Horvath D et al (2019) Virtual screening with generative topographic maps: how many maps are required? J Chem Inf Model 59:564–572. https://doi.org/10.1021/acs.jcim.8b00650

    CAS  Article  PubMed  Google Scholar 

  16. 16.

    ChemAxon Standardizer. https://docs.chemaxon.com/display/docs/Standardizer. Accessed 1 Feb 2019

  17. 17.

    ChemAxon JChem. https://chemaxon.com/products/jchem-engines. Accessed 1 Feb 2019

  18. 18.

    Bishop CM, Svensén M, Williams CKI (1998) GTM: the generative topographic mapping. Neural Comput 10:215–234. https://doi.org/10.1162/089976698300017953

    Article  Google Scholar 

  19. 19.

    Sidorov P, Viira B, Davioud-Charvet E et al (2017) QSAR modeling and chemical space analysis of antimalarial compounds. J Comput Aided Mol Des 31:441–451. https://doi.org/10.1007/s10822-017-0019-4

    CAS  Article  PubMed  Google Scholar 

  20. 20.

    Monev V (2004) Introduction to similarity searching in chemistry *. Match-Commun Math Comput Chem 51:7–38

    CAS  Google Scholar 

  21. 21.

    (2019) RDKit: Open-source cheminformatics. http://www.rdkit.org. Accessed 1 Feb 2019

  22. 22.

    Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754. https://doi.org/10.1021/ci100050t

    CAS  Article  Google Scholar 

  23. 23.

    Gaspar HA, Baskin II, Marcou G et al (2015) Chemical data visualization and analysis with incremental generative topographic mapping: big data challenge. J Chem Inf Model 55:84–94. https://doi.org/10.1021/ci500575y

    CAS  Article  PubMed  Google Scholar 

  24. 24.

    Sidorov P, Gaspar H, Marcou G et al (2015) Mappability of drug-like space: towards a polypharmacologically competent map of drug-relevant compounds. J Comput Aided Mol Des 29:1087–1108. https://doi.org/10.1007/s10822-015-9882-z

    CAS  Article  PubMed  Google Scholar 

  25. 25.

    Volochnyuk DM, Ryabukhin SV, Moroz YS et al (2019) Evolution of commercially available compounds for HTS. Drug Discov Today 24:390–402. https://doi.org/10.1016/j.drudis.2018.10.016

    CAS  Article  PubMed  Google Scholar 

  26. 26.

    Dauber-Osguthorpe P, Roberts VA, Osguthorpe DJ et al (1988) Structure and energetics of ligand binding to proteins: escherichia coli dihydrofolate reductase-trimethoprim, a drug-receptor system. Proteins Struct Funct Bioinform 4:31–47. https://doi.org/10.1002/prot.340040106

    CAS  Article  Google Scholar 

  27. 27.

    Ruggiu F, Marcou G, Varnek A, Horvath D (2010) ISIDA property-labelled fragment descriptors. Mol Inform 29:855–868. https://doi.org/10.1002/minf.201000099

    CAS  Article  PubMed  Google Scholar 

  28. 28.

    Marcou G, Solov’ev VP, Horvath D, Varnek A (2017) ISIDA fragmentor—user manual

  29. 29.

    Horvath D, Brown J, Marcou G, Varnek A (2014) An evolutionary optimizer of libsvm models. Challenges 5:450–472

    Article  Google Scholar 

  30. 30.

    Klimenko K, Marcou G, Horvath D, Varnek A (2016) Chemical space mapping and structure-activity analysis of the ChEMBL antiviral compound set. J Chem Inf Model 56:1438–1454. https://doi.org/10.1021/acs.jcim.6b00192

    CAS  Article  PubMed  Google Scholar 

  31. 31.

    Hariharan R, Janakiraman A, Nilakantan R et al (2011) MultiMCS: a fast algorithm for the maximum common substructure problem on multiple molecules. J Chem Inf Model 51:788–806. https://doi.org/10.1021/ci100297y

    CAS  Article  PubMed  Google Scholar 

  32. 32.

    Oliphant TE (2006) A guide to NumPy. Tregol Publishing, USA

  33. 33.

    Oliphant TE (2007) Python for scientific computing. Comput Sci Eng 9:10–20. https://doi.org/10.1109/MCSE.2007.58

    CAS  Article  Google Scholar 

  34. 34.

    Inc. PT (2015) Collaborative data science. In: Plotly Technol. Inc. https://plot.ly. Accessed 1 Feb 2019

  35. 35.

    Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (2012) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 64:4–17. https://doi.org/10.1016/j.addr.2012.09.019

    Article  Google Scholar 

  36. 36.

    Brenk R, Schipani A, James D et al (2008) Lessons learnt from assembling screening libraries for drug discovery for neglected diseases. ChemMedChem Chem Enabling Drug Discov 3:435–444

    CAS  Google Scholar 

  37. 37.

    Baell JB, Holloway GA (2010) New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem 53:2719–2740

    CAS  Article  Google Scholar 

  38. 38.

    Doveston RG, Tosatti P, Dow M et al (2015) A unified lead-oriented synthesis of over fifty molecular scaffolds. Org Biomol Chem 13:859–865

    CAS  Article  Google Scholar 

  39. 39.

    Jadhav A, Ferreira RS, Klumpp C et al (2009) Quantitative analyses of aggregation, autofluorescence, and reactivity artifacts in a screen for inhibitors of a thiol protease. J Med Chem 53:37–51

    Article  Google Scholar 

  40. 40.

    Gaulton A, Hersey A, Nowotka ML et al (2017) The ChEMBL database in 2017. Nucleic Acids Res 45:D945–D954. https://doi.org/10.1093/nar/gkw1074

    CAS  Article  PubMed  PubMed Central  Google Scholar 

Download references


The authors thank Boehringer Ingelheim Pharma GmbH & Co KG for the provided data.


The project leading to this article has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Grant Agreement No. 676434, “Big Data in Chemistry” (“BIGCHEM”, http://bigchem.eu).

Author information




The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.

Corresponding authors

Correspondence to Bernd Beck or Alexandre Varnek.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lin, A., Beck, B., Horvath, D. et al. Diversifying chemical libraries with generative topographic mapping. J Comput Aided Mol Des 34, 805–815 (2020). https://doi.org/10.1007/s10822-019-00215-x

Download citation


  • Generative topographic mapping
  • Chemical library diversity enrichment
  • Big data