Skip to main content
Log in

Comparison of descriptor spaces for chemical compound retrieval and classification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In recent years the development of computational techniques that build models to correctly assign chemical compounds to various classes or to retrieve potential drug-like compounds has been an active area of research. Many of the best-performing techniques for these tasks utilize a descriptor-based representation of the compound that captures various aspects of the underlying molecular graph’s topology. In this paper we compare five different set of descriptors that are currently used for chemical compound classification. We also introduce four different descriptors derived from all connected fragments present in the molecular graphs primarily for the purpose of comparing them to the currently used descriptor spaces and analyzing what properties of descriptor spaces are helpful in providing effective representation for molecular graphs. In addition, we introduce an extension to existing vector-based kernel functions to take into account the length of the fragments present in the descriptors. We experimentally evaluate the performance of the previously introduced and the new descriptors in the context of SVM-based classification and ranked-retrieval on 28 classification and retrieval problems derived from 18 datasets. Our experiments show that for both of these tasks, two of the four descriptors introduced in this paper along with the extended connectivity fingerprint based descriptors consistently and statistically outperform previously developed schemes based on the widely used fingerprint- and Maccs keys-based descriptors, as well as recently introduced descriptors obtained by mining and analyzing the structure of the molecular graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. http://dtp.nci.nih.gov. The Aids Antiviral Screen

  2. http://www.chemcomp.com. Chemical Computing Group

  3. http://www.daylight.com. Daylight Inc

  4. http://www.mdl.com. MDL Information Systems Inc

  5. http://www.scitegic.com. Scitegic Inc

  6. http://www.tripos.com. Tripos Inc

  7. http://www.tripos.com. Tripos Inc

  8. Mdl drug data report, version 2002.2. MDL Information Systems Inc. San Leandro, CA

  9. http://pubchem.ncbi.nlm.nih.gov. The PubChem Project

  10. http://www.chemaxon.com. ChemAxon Inc

  11. http://www.predictive-toxicology.org

  12. Food and drug administration orange book, 22nd edn. U.S Food and Drug Administration, Washington DC (2003)

  13. Ames BN, Durston WE, Yamasaki E and Lee FD (1973). Carcinogens are mutagens: a simple test system combining liver homogenates for activation and bacteria for detection. Proc Natl Acad Sci 70: 2281–2285

    Article  Google Scholar 

  14. Barnard JM and Downs GM (1997). Chemical fragment generation and clustering software. J Chem Inf Comput Sci 37: 141–142

    Article  Google Scholar 

  15. Bland JM (1995). An introduction to medical statistics, 2nd edn. Oxford University Press, Oxford

    Google Scholar 

  16. Brown R and Martin Y (1996). Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Model 36(1): 576–584

    Google Scholar 

  17. Deshpande M, Kuramochi M, Wale N and Karypis G (2005). Frequent substructure-based approaches for classifying chemical compounds. IEEE TKDE 17(8): 1036–1050

    Google Scholar 

  18. Durant JL, Leland BA, Henry DR and Nourse JG (2002). Reoptimization of mdl keys for use in drug discovery. J Chem Inf Model 42(6): 1273–1280

    Google Scholar 

  19. Gold LS and Zeiger E (1997). Handbook of carcinogenic potency and genotoxicity databases. CRC Press, BOCA Raton

    Google Scholar 

  20. Gribskov M and Robinson N (1996). Use of receiver operating characteristic (roc) analysis to evaluate matching. Comput Chem 20: 25–33

    Article  Google Scholar 

  21. Helma C, Cramer T, Kramer S and Raedt LD (2004). Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. J Chem Inf Comp Sci 44(4): 1402–1411

    Article  Google Scholar 

  22. Hert J, Willet P, Wilton D, Acklin P, Azzaoui K, Jacoby E and Schuffenhauer A (2004). Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. Org Biomol Chem 2: 3256–3266

    Article  Google Scholar 

  23. Horvath T, Grtner T, Wrobel S (2004) Cyclic pattern kernels for predictive graph mining. SIGKDD. pp 158–167

  24. Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. PKDD. pp 13–23

  25. Joachims T (1999). Advances in kernel methods: support vector learning, making large-scale svm learning practical. MIT-Press, Cambridge

    Google Scholar 

  26. Kashima H, Tsuda K, Inokuchi A (2003) Marginalized kernels between labeled graphs. ICML

  27. Kier L, Hall L (1999) Molecular structure description. ic Press

  28. Kramer S, Raedt LD, Helma C (2001) Molecular feature mining in hiv data. SIGKDD

  29. Kuramochi M and Karypis G (2004). An efficient algorithm for discovering frequent subgraphs. IEEE TKDE 16(9): 1038–1051

    Google Scholar 

  30. Leach AR (2001). Molecular modeling: principles and applications. Prentice Hall, Englewood Cliffs

    Google Scholar 

  31. Menchetti S, Costa F, Frasconi P (2005) Weighted decomposition kernels. ICML

  32. Morgan HL (1965). The generation of unique machine description for chemical structures: a technique developed at chemical abstract services. J Chem Doc 5: 107–1133

    Article  Google Scholar 

  33. Nijssen S, Kok JN (2004) A quickstart in frequent structure mining can make a difference. SIGKDD

  34. Richards GW (2002). Virtual screening using grid computing: the screensaver project. Nat Rev: Drug Discov 1: 551–554

    Article  Google Scholar 

  35. Rogers D, Brown R and Hahn M (2005). Using extended-connectivity fingerprints with laplacian-modified bayesian analysis in high-throughput screening. J Biomol Screen 10(7): 682–686

    Article  Google Scholar 

  36. Srinivasan A, King RD, Muggleton SH, Sternberg M (1997) The predictive toxicology evaluation challenge. IJCAI-97, pp 1–6

  37. Swamidass SJ, Chen J, Bruand J, Phung P, Ralaivola L and Baldi P (2005). Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics 21(1): 359–368

    Article  Google Scholar 

  38. Vapnik V (1998). Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  39. Vieth M, Siegel MG, Higgs RE, Watson IA, Robertson DH, Savin KA, Durst GL and Hipskind PA (2004). Characteristic physical properties and structural fragments of marketed oral drug. J Med Chem 47(1): 224–232

    Article  Google Scholar 

  40. Wale N, Karypis G (2006) Comparison of descriptor spaces for chemical compound retrieval and classification. International Conference in Datamining. (ICDM)

  41. West DB (2001). Introduction to graph theory. Prentice Hall, Englewood Cliffs

    Google Scholar 

  42. Whittle M, Gillet VJ and Willett P (2004). Enhancing the effectiveness of virtual screening by fusing nearest neighbor list: A comparison of similarity coefficients. J Chem Inf Model 44: 1840–1848

    Article  Google Scholar 

  43. Willett P (1998). Chemical similarity searching. J Chem Inf Model 38(6): 983–996

    Article  Google Scholar 

  44. Wrlein M, Meinl T, Fischer I, Philippsen M (2005) A quantitative comparison of the subgraph miners mofa, gspan, ffsm, and gaston. PKDD

  45. Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. ICDM

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikil Wale.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wale, N., Watson, I.A. & Karypis, G. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl Inf Syst 14, 347–375 (2008). https://doi.org/10.1007/s10115-007-0103-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-007-0103-5

Keywords

Navigation