Comparison of Methods Based on Diversity and Similarity for Molecule Selection and the Analysis of Drug Discovery Data

  • Raymond L.H. Lam
  • William J. Welch
Part of the Methods in Molecular Biology™ book series (MIMB, volume 275)


The concepts of diversity and similarity of molecules are widely used in quantitative methods for designing (selecting) a representative set of molecules and for analyzing the relationship between chemical structure and biological activity. We review methods and algorithms for design of a diverse set of molecules in the chemical space using clustering, cell-based partitioning, or other distance-based approaches. Analogous cell-based and clustering methods are described for analyzing drug-discovery data to predict activity in virtual screening. Some performance comparisons are made. The choice of descriptor variables to characterize chemical structure is also included in the comparative study. We find that the diversity of a selected set is quite sensitive to both the statistical selection method and the choice of molecular descriptors and that, for the dataset used in this study, random selection works surprisingly well in providing a set of data for analysis.

Key Words

Biological activity cell-based partitioning chemical descriptors classification clustering distance-based design diversity selection high-throughput screening quantitative structure-activity relationship 


  1. 1.
    Abt, M., Lim, Y.-B., Sacks, J., Xie, M., and Young, S. S. (2001) A sequential approach for identifying lead compounds in large chemical databases. Stat. Sci. 16, 154–168.CrossRefGoogle Scholar
  2. 2.
    Engels, M. F. M. and Venkatarangan, P. (2001) Smart screening: approaches to efficient HTS. Curr. Opin. Drug Disc. Dev. 4, 275–283.Google Scholar
  3. 3.
    Jones-Hertzog, D. K., Mukhopadhyay, P., Keefer, C. E., and Young, S. S. (1999) Use of recursive partitioning in the sequential screening of G-protein-coupled receptors. J. Pharmacol. Toxicol. 42, 207–215.CrossRefGoogle Scholar
  4. 4.
    van Rhee, A. M., Stocker, J., Printzenhoff, D., Creech, C., Wagoner, P. K., and Spear, K. L. (2001) Retrospective analysis of an experimental high-throughput screening data set by recursive partitioning. J. Comb. Chem. 3, 267–277.PubMedCrossRefGoogle Scholar
  5. 5.
    Warmuth, M. K., Liao, J., Rätsch, G., Mathieson, M., Putta, S., and Lemmen, C. (2003) Active learning with support vector machines in the drug discovery process. J. Chem. Inf. Comput. Sci. 43, 667–673.PubMedGoogle Scholar
  6. 6.
    Todeschini, R. and Consonni, V. (2000) Handbook of molecular descriptors. Wiley-VCH, Weinheim, Germany.CrossRefGoogle Scholar
  7. 7.
    Leach, A. R. and Gillet, V. J. (2003) An introduction to chemoinformatics. Kluwer Academic Publishers, London, UK.Google Scholar
  8. 8.
    Brown, R. D. and Martin, Y. C. (1996) Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci. 36, 572–584.Google Scholar
  9. 9.
    Feng, J., Lurati, L., Ouyang, H., et al. (2003) Predictive toxicology: benchmarking molecular descriptors and statistical methods. J. Chem. Inf. Comput. Sci. 43, 1463–1470.PubMedGoogle Scholar
  10. 10.
    Burden, F. R. (1989) Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 29, 225–227.Google Scholar
  11. 11.
    Pearlman, R. S. and Smith, K. M. (1998) Novel software tools for chemical diversity. Persp. Drug Disc. Des. 09/10/11, 339–353.CrossRefGoogle Scholar
  12. 12.
    Hastie, T., Tibshirani, R., and Friedman, J. (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York, NY.Google Scholar
  13. 13.
    Zemroch, P. J. (1986) Cluster analysis as an experimental design generator, with application to gasoline blending experiments. Technometrics 28, 39–49.CrossRefGoogle Scholar
  14. 14.
    Hansch, C., Unger, S. H., and Forsythe, A. B. (1973) Strategy in drug design. Cluster analysis as an aid in the selection of substituents. J. Med. Chem. 16, 1217–1222.PubMedCrossRefGoogle Scholar
  15. 15.
    Hodes, L. (1989) Clustering a large number of compounds. 1. Establishing the method on an initial sample. J. Chem. Inf. Comput. Sci. 29, 66–71.PubMedGoogle Scholar
  16. 16.
    Cummins D. J., Andrews C. W., Bentley J. A., and Cory, M. (1996) Molecular diversity in chemical databases: Comparison of medicinal chemistry knowledge bases and databases of commercially available compounds. J. Chem. Inf. Comput. Sci. 36, 750–763.PubMedGoogle Scholar
  17. 17.
    Menard, P. R., Mason, J. S., Morize, I., and Bauerschmidt, S. (1998) Chemistry space metrics in diversity analysis, library design, and compound selection. J. Chem. Inf. Comput. Sci. 38, 1204–1213.Google Scholar
  18. 18.
    McFarland, J. W. and Gans, D.J. (1986) On the significance of clusters in the graphical display of structure-activity data. J. Med. Chem. 29, 505–514.PubMedCrossRefGoogle Scholar
  19. 19.
    Lam, R. L. H. (2001) Design and analysis of large chemical databases for drug discovery, Ph.D. Dissertation, University of Waterloo.Google Scholar
  20. 20.
    Lam, R. L. H., Welch, W. J., and Young, S. S. (2002) Uniform coverage designs for molecule selection. Technometrics 44, 99–109.CrossRefGoogle Scholar
  21. 21.
    Pearlman, R. S. and Smith, K. M. (1999) Metric validation and the receptor-relevant subspace concept. J. Chem. Inf. Comput. Sci. 39, 28–35.Google Scholar
  22. 22.
    Kennard, R. W., and Stone, L. A. (1969) Computer aided design of experiments. Technometrics 11, 137–148.CrossRefGoogle Scholar
  23. 23.
    Johnson, M. E., Moore, L. M., and Ylvisaker, D. (1990) Minimax and maximin distance designs. J. Statist. Plan. Infer. 26, 131–148.CrossRefGoogle Scholar
  24. 24.
    Higgs, R. E., Bemis, K. G., Watson, I. A., and Wikel, J. H. (1997) Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comput. Sci. 37, 861–870.Google Scholar
  25. 25.
    Lam, R. L. H., Welch, W. J., and Young, S. S. (2002) Cell-based analysis of high throughput screening data for drug discovery. Research Report RR-02-02, Institute for Improvement in Quality and Productivity, University of Waterloo.Google Scholar
  26. 26.
    Yi, B., Hughes-Oliver, J. M., Zhu, L., and Young, S. S. (2002) A factorial design to optimize cell-based drug discovery analysis. J. Chem. Inf. Comput. Sci. 42, 1221–1229.PubMedGoogle Scholar
  27. 27.
    Young, S. S., Farmen, M., and Rusinko, A. III (1996) Random versus rational: Which is better for general compound screening? Network Science online publication, available at URL:

Copyright information

© Humana Press Inc. 2004

Authors and Affiliations

  • Raymond L.H. Lam
    • 1
  • William J. Welch
    • 2
    • 3
  1. 1.Department of Data Exploration SciencesGlaxoSmithKlineKing of PrussiaUSA
  2. 2.Department of StatisticsUniversity of British ColumbiaVancouver
  3. 3.Department of Statistics and Actuarial ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations