Mapping Between Databases of Compounds and Protein Targets

  • Sorel MuresanEmail author
  • Markus Sitzmann
  • Christopher Southan
Part of the Methods in Molecular Biology book series (MIMB, volume 910)


Databases that provide links between bioactive compounds and their protein targets are increasingly important in drug discovery and chemical biology. They join the expanding universes of cheminformatics via chemical structures on the one hand and bioinformatics via sequences on the other. However, it is difficult to assess the relative utility of databases without the explicit comparison of content. We have exemplified an approach to this by comparing resources that each has a different focus on bioactive chemistry (ChEMBL, DrugBank, Human Metabolome Database, and Therapeutic Target Database) both at the chemical structure and protein levels. We compared the compound sets at different representational stringencies using NCI/CADD Structure Identifiers. The overlap and uniqueness in chemical content can be broadly interpreted in the context of different data capture strategies. However, we recorded apparent anomalies, such as many compounds-in-common between the metabolite and drug databases. We also compared the content of sequences mapped to the compounds via their UniProt protein identifiers. While these were also generally interpretable in the context of individual databases we discerned differences in coverage and the types of supporting data used. For example, the target concept is applied differently between DrugBank and the Therapeutic Target Database. In ChEMBL it encompasses a broader range of mappings from chemical biology and species orthologue cross-screening in addition to drug targets per se. Our analysis should assist users not only in exploiting the synergies between these four high-value resources but also in assessing the utility of other databases at the interface of chemistry and biology.

Key words

Bioactive compounds Small-molecule databases Chemical structure identifiers Cheminformatics Bioinformatics Drug targets ChEMBL DrugBank Human Metabolome Database Therapeutic Target Database 


  1. 1.
    Chemical Structure Lookup Service (CSLS). Accessed 27 Oct 2010
  2. 2.
    The UniProt Consortium (2010) The universal protein resource (UniProt) in 2010. Nucleic Acids Res 38:D142–D148CrossRefGoogle Scholar
  3. 3.
    Southan C, Varkonyi P, Muresan S (2009) Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds. J Cheminfo. doi: 10.1186/1758-2946-1-10
  4. 4.
    ChEMBL. Accessed 19 Sept 2010
  5. 5.
    Wishart DS, Knox C, Guo AC et al (2008) DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 36:D901–D906PubMedCrossRefGoogle Scholar
  6. 6.
    Wishart DS, Knox C, Guo AC et al (2009) HMDB: a knowledgebase for the human metabolome. Nucleic Acids Res 37:D603–D610PubMedCrossRefGoogle Scholar
  7. 7.
    Zhu F, Han B, Kumar P et al (2010) Update of TTD: therapeutic target database. Nucleic Acids Res 38:D787–D791PubMedCrossRefGoogle Scholar
  8. 8.
    The Protein Identifier Cross-Reference Service. Accessed 27 Oct 2010
  9. 9.
    The IUPAC International Chemical Identifier (Version 1.03). Accessed 27 Oct 2010
  10. 10.
    NCI/CADD Chemical Identifier Resolver. Accessed 27 Oct 2010
  11. 11.
    Oliveros JC (2007) VENNY: an interactive tool for comparing lists with Venn diagrams. Accessed 27 Oct 2010
  12. 12.
    de Matos P, Alcántara R, Dekker A et al (2010) Chemical entities of biological interest: an update. Nucleic Acids Res 38:D249–D254PubMedCrossRefGoogle Scholar
  13. 13.
    Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36CrossRefGoogle Scholar
  14. 14.
    Daylight Chemical Information Systems Inc. Accessed 27 Oct 2010
  15. 15.
    Dalby A, Nourse JG, Hounshell WD et al (1992) Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J Chem Inf Comput Sci 32:244–255CrossRefGoogle Scholar
  16. 16.
    CTfile Formats. Accessed 27 Oct 2010
  17. 17.
    Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29:97–101CrossRefGoogle Scholar
  18. 18.
    Ihlenfeldt WD, Gasteiger J (1994) Hash codes for the identification and classification of molecular structure elements. J Comput Chem 15:793–813CrossRefGoogle Scholar
  19. 19.
    InChI TRUST—History of InChI. Accessed 10 Sept 2010
  20. 20.
    CAS Registry Numbers. Accessed 10 Sept 2010
  21. 21.
    ChemSpider. Accessed 10 Oct 2010
  22. 22.
    Li Q, Cheng T, Wang Y et al (2010) PubChem as a public resource for drug discovery. Drug Discov Today 15:1052–1057PubMedCrossRefGoogle Scholar
  23. 23.
    Sitzmann M, Filippov IV, Nicklaus MC (2008) Internet resources integrating many small-molecule databases. SAR QSAR Environ Res 19:1–9PubMedCrossRefGoogle Scholar
  24. 24.
    Ihlenfeldt WD, Takahashi Y, Abe H et al (1994) Computation and management of chemical properties in CACTVS: an extensible networked approach toward modularity and compatibility. J Chem Inf Comput Sci 34:109–116CrossRefGoogle Scholar
  25. 25.
    Xemistry GmbH. Accessed 10 Oct 2010
  26. 26.
    ChemNavigator—iResearch Library. Accessed 10 Oct 2010
  27. 27.
    PubChem Substance Set. Accessed 10 Oct 2010
  28. 28.
    Irwin JJ, Shoichet BK (2005) ZINC—a free database of commercially available compounds for virtual screening. J Chem Inf Model 45:177–182PubMedCrossRefGoogle Scholar
  29. 29.
    eMolecules. Accessed 8 Sept 2010
  30. 30.
    Overington JP, Al-Lazikani B, Hopkins AL (2006) How many drug targets are there? Nat Rev Drug Discov 5:993–996PubMedCrossRefGoogle Scholar
  31. 31.
    Wang X, Chen C-F, Baker PR et al (2007) Mass spectrometric characterization of the affinity-purified human 26S proteasome complex. Biochemistry 46:3553–3565PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • Sorel Muresan
    • 1
    Email author
  • Markus Sitzmann
    • 2
  • Christopher Southan
    • 1
    • 3
  1. 1.DECS Global Compound Sciences, Computational Chemistry, AstraZeneca R&DMölndalSweden
  2. 2.Chemical Biology LaboratoryCenter for Cancer Research, National Cancer Institute, National Institutes of HealthFrederickUSA
  3. 3.ChrisDS ConsultingGöteborgSweden

Personalised recommendations