Molecular Diversity

, Volume 10, Issue 3, pp 283–299 | Cite as

Cheminformatics analysis and learning in a data pipelining environment

  • Moises HassanEmail author
  • Robert D. Brown
  • Shikha Varma-O’Brien
  • David Rogers
Full–length paper


Workflow technology is being increasingly applied in discovery information to organize and analyze data. SciTegic's Pipeline Pilot is a chemically intelligent implementation of a workflow technology known as data pipelining. It allows scientists to construct and execute workflows using components that encapsulate many cheminformatics based algorithms. In this paper we review SciTegic's methodology for molecular fingerprints, molecular similarity, molecular clustering, maximal common subgraph search and Bayesian learning. Case studies are described showing the application of these methods to the analysis of discovery data such as chemical series and high throughput screening results. The paper demonstrates that the methods are well suited to a wide variety of tasks such as building and applying predictive models of screening data, identifying molecules for lead optimization and the organization of molecules into families with structural commonality.

Key words

Bayesian models bioactivity prediction data mining data pipelining maximal common substructure search molecular fingerprints molecular similarity virtual screening 



maximal common substructure search


extended connectivity fingerprints


functional class fingerprints


MDL drug data report


world drug index


chemically advanced template search


binary kernel discrimination


cyclin-dependent kinase 2


escherichia coli dihydrofolate reductase


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    SciTegic, Inc. 10188 Telesis Court, Suite 100, San Diego, CA 92121, USA,
  2. 2.
    Todeschini, R. and Consonni, V., Handbook of Molecular Descriptors, Wiley-VCH, Weinheim, Germany, 2000.Google Scholar
  3. 3.
    Mark Johnson, M., Maggiora, G., (Eds.) Concepts and Applications of Molecular Similarity. Wiley, New York, 1990.Google Scholar
  4. 4.
    McGregor, M.J. and Pallai, P.V., Clustering of large databases of compounds: Using the MDL ‘keys’ as structural descriptors, J. Chem. Inf. Comput. Sci., 37 (1997) 443–448.CrossRefGoogle Scholar
  5. 5.
    Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J., Classification and Regression Trees, Wadsworth and Brooks/Cole, Monterey, CA, 1984.Google Scholar
  6. 6.
    Dubois, J. E., In Chemical Applications of Graph Theory, In Balaban, A.T. (Ed.) Academic Press, London, 1976, p. 161.Google Scholar
  7. 7.
    Randic, M., Fragment search in acyclic structures, J. Chem. Inf. Comput.Sci., 18 (1978) 101–107.CrossRefGoogle Scholar
  8. 8.
    Willett, P., A screen set generation algorithm, J. Chem. Inf. Comp. Sci., 19 (1979) 159–162.CrossRefGoogle Scholar
  9. 9.
    Marie, T., Gannon and Willett, P., Sampling considerations in the selection of fragments screens for chemical substructure search systems, J. Chem. Inf. Comp. Sci., 19 (1979) 251–253.CrossRefGoogle Scholar
  10. 10.
    Willett, P., The effect of screen set size on retrieval from chemical substructure search systems, J. Chem. Inf. Comp. Sci., 19 (1979) 253–255.CrossRefGoogle Scholar
  11. 11.
    Schubert, W. and Ugi, I., Constitutional symmetry and unique descriptors of molecules, J. Amer. Chem. Soc., 100 (1978) 37–41.CrossRefGoogle Scholar
  12. 12.
    Bremser, W., HOSE – A novel substructure code, Anal. Chim. Acta, 103 (1978) 355–365.CrossRefGoogle Scholar
  13. 13.
    Bender, A., Mussa, H.Y., Glen, R.C. and Reiling, S. Molecular similarity searching using atom environments, information-based feature selection, and a naive Bayesian classifier, J.Chem. Inf. Comput. Sci., 44 (2004) 170–178.PubMedCrossRefGoogle Scholar
  14. 14.
    Morgan, H. L., The generation of a unique machine description for chemical structures-A technique developed at chemical sbstracts service, J. Chem. Doc., 5 (1965) 107–112.CrossRefGoogle Scholar
  15. 15.
    Weininger, D., Weininger, A. and Weininger, J.L., SMILES. 2. Algorithm for generation of unique SMILES notation, J. Chem. Inf. Comp. Sci., 29 (1989) 97–101.CrossRefGoogle Scholar
  16. 16.
    Rogers, D. and Hahn, M., Extended connectivity fingerprints, J. Chem. Inf. Model., in preparation.Google Scholar
  17. 17.
    Bender, A. and Glen, R.C., Molecular similarity: A key technique in molecular informatics, Org. Biomol. Chem., 2 (2004) 3204–3218.PubMedCrossRefGoogle Scholar
  18. 18.
    Hert, J., Willett, P., Wilton, D.J., Acklin P., Azzaoui, K., Jacoby, E. and Schuffenhauer, A., Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures, J. Chem. Inf. Comput. Sci., 44 (2004) 1177–1185.PubMedCrossRefGoogle Scholar
  19. 19.
    Everitt and Brian S., Cluster Analysis, Edward Arnold, A division of Hodder & Stoughton, London, 1997.Google Scholar
  20. 20.
    Kaufman, L. and Rousseeu, P., Finding Groups in Data, Wiley-Interscience, New York, 1990.Google Scholar
  21. 21.
    Hassan, M., Bielawski, J.P., Hempel, J.C. and Waldman, M., Optimization and visualization of molecular diversity and combinatorial libraries, Molecular Diversity, 2 (1996) 64–74.PubMedCrossRefGoogle Scholar
  22. 22.
    Asinex, Incorporated, 6 Schukinskaya St, Moscow 123182, Russia;
  23. 23.
    Raymond, J.W., Gardiner, E.J. and Willett, P. Rascal, calculation of graph similarity using maximum common edge subgraphs, Comput. J., 45 (2002) 631–644.CrossRefGoogle Scholar
  24. 24.
    Raymond, J.W., Gardiner, E.J. and Willett, P., Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm, J. Chem. Inf. Comput. Sci., 42 (2002) 305–316.PubMedCrossRefGoogle Scholar
  25. 25.
    Xia, X., Maliski E.G., Gallant, P. and Rogers, D., Classification of kinase inhibitors using a Bayesian model, J. Med. Chem., 47 (2004) 4463–4470.PubMedCrossRefGoogle Scholar
  26. 26.
    Hert, J., Willett, P., David J.W., Acklin P., Azzaoui K., Jacoby E. and Schuffenhauer A., New methods for ligand-based virtual screening: Use of data fusion and machine learning to enhance the effectiveness of similarity searching, J. Chem. Inf. Model. (2006), in press.Google Scholar
  27. 27.
    Robertson, S.E. and Sparck J.K., Relevance weighting of search terms, J. Amer. Soc. Inform. Sci., 27 (1976) 129–146.CrossRefGoogle Scholar
  28. 28.
    Avidon, V.V., Arolovich, V.S., Kozlava, S.P. and Piruzyan, L.A., Statistical study of information file on biologically active compounds. II. Choice of decision rule for biologically active prediction, Khim. Farm. Zh., 12 (1978) 88–93.Google Scholar
  29. 29.
    Hert, J., Willett, P., Wilton, D.J., Acklin P., Azzaoui, K., Jacoby E. and Schuffenhauer A., Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures, Org. Biomol. Chem., 2 (2004) 3256–3266.PubMedCrossRefGoogle Scholar
  30. 30.
    Barnard Chemical Information Ltd. is at
  31. 31.
    Daylight Chemical Information Systems, 27401 Los Altos, Suite 360, Mission Viejo, CA, USA 92691;
  32. 32.
    Tripos Inc. is at
  33. 33.
    Schuffenhauer, P., Floersheim, P., Acklin, P. and Jacoby, E., Similarity metrics for ligands reflecting the similarity of the target proteins, J. Chem. Inf. Comput. Sci., 43 (2003) 391–405.PubMedCrossRefGoogle Scholar
  34. 34.
    Schneider, G., Neidhart, W., Giller, T. and Schmid, G., Scaffold-hopping by topological pharmacophore search: A contribution to virtual screening, Angew. Chem. Int. Ed. Engl., 38 (1999) 2894–896.PubMedCrossRefGoogle Scholar
  35. 35.
    The MDL Drug Data Report database is available from MDL Information Systems Inc. at
  36. 36.
    Bemis, G.M. and Murcko, M.A., The properties of known drugs. 1. Molecular frameworks, J. Med. Chem., 39 (1996) 2887–2893.PubMedCrossRefGoogle Scholar
  37. 37.
    National Cancer Institute database, available at
  38. 38.
    Sielecki, T.M., Boylan, J.F., Benfield, P.A. and Trainor, G.L., Cyclin-dependent kinase inhibitors: Useful targets in cell cycle regulation. J. Med. Chem., 43 (2000) 1–18.PubMedCrossRefGoogle Scholar
  39. 39.
    Buolamwini, J.K., Cell cycle molecular targets in novel anticancer drug discovery. Curr. Pharm. Des., 6 (2000) 379–392.PubMedCrossRefGoogle Scholar
  40. 40.
    Meijer, L., Cyclin-dependent kinases inhibitors as potential anticancer, antineurodegenerative, antiviral and antiparasitic agents, Drug Resist. Updates, 3 (2000) 83–88.CrossRefGoogle Scholar
  41. 41.
    Sausville, E.A., Johnson, J., Alley, M., Zaharevitz, D. and Senderowicz, A.M., Inhibition of CDKs as a therapeutic modality, Ann. N. Y. Acad. Sci., 910, Colorectal Cancer (2000) 207–222.PubMedCrossRefGoogle Scholar
  42. 42.
    Mani, S., Wang, C., Wu, K., Francis, R. and Pestell, R., Cyclin-dependent kinase inhibitors: Novel anticancer agents. Exp. Opin. Invest. Drugs 9 (2000) 1849–1870.CrossRefGoogle Scholar
  43. 43.
    Fischer, P.M. and Lane, D.P., Inhibitors of cyclin-dependent kinases as anti-cancer therapeutics, Curr. Med. Chem., 7 (2000) 1213–1245.PubMedGoogle Scholar
  44. 44.
    Senderowicz, A.M., Small molecule modulators of cyclin-dependent kinases for cancer therapy, Oncogene, 19 (2000) 6600–6606.PubMedCrossRefGoogle Scholar
  45. 45.
    Senderowicz, A.M., Development of cyclin-dependent kinase modulators as novel therapeutic approaches for hematological malignancies. Leukemia, 15 (2001) 1–9.PubMedCrossRefGoogle Scholar
  46. 46.
    Senderowicz, A.M., Cyclin-Dependent Kinase Modulators: A Novel Class of Cell Cycle Regulators for Cancer Therapy. In Cancer Chemotherapy and Biological Response Modifiers, Annual 19; Giaccone, G., Schilsky, R., Sondel, P., (Eds.), Elsevier Science: New York, 2001, pp 165–188.Google Scholar
  47. 47.
    Roy, K.K. and Sausville, E.A., Early development of cyclin dependent kinase modulators, Curr. Pharm. Des., 7 (2001) 1669–1687.PubMedCrossRefGoogle Scholar
  48. 48.
    Fischer, P.M., Recent advances and new directions in the discovery and development of cyclin-dependent kinase inhibitors, Curr. Opin. Drug Discovery Dev., 4 (2001) 623–634.Google Scholar
  49. 49.
    Bradley, E.K., Miller J.L., Saiah, E. and Grootenhuis, P.D.J., Informative library design as an efficient strategy to identify and optimize leads: Application to cyclin-dependent kinase 2 antagonists, J. Med. Chem., 46 (2003) 4360–4364.PubMedCrossRefGoogle Scholar
  50. 50.
    Parker, C.N., McMaster university data-mining and docking competition. Computational models on the catwalk, J. Biomol. Screening, 10 (2005) 647–649.CrossRefGoogle Scholar
  51. 51.
    Rogers, D., Brown, R.D and Hahn, M., Using extended-connectivity fingerprints with laplacian-modified Bayesian analysis in high-throughput screening follow-up, J. Biomol. Screening, 10 (2005), 682–686.CrossRefGoogle Scholar
  52. 52.
    Klon, A.E., Glick, M., Thomas, M., Acklin, P. and Davies, J. W., Finding more needles in the haystack: A simple and efficient method for improving high-throughput docking results, J. Med. Chem., 47 (2004) 2743–2749.PubMedCrossRefGoogle Scholar
  53. 53.
    Klon, A.E., Glick, M. and Davies, J.W., Combination of a Naive Bayes classifier with consensus scoring improves enrichment of high-throughput docking results, J. Med. Chem., 47 (2004) 4356–4359.PubMedCrossRefGoogle Scholar

Copyright information

© SpringerScience + Business Media, Inc. 2006

Authors and Affiliations

  • Moises Hassan
    • 1
    Email author
  • Robert D. Brown
    • 1
  • Shikha Varma-O’Brien
    • 2
  • David Rogers
    • 1
  1. 1.SciTegic, Inc.San DiegoUSA
  2. 2.Accelrys, Inc.San DiegoUSA

Personalised recommendations