Comparison of Methods Based on Diversity and Similarity for Molecule Selection and the Analysis of Drug Discovery Data
The concepts of diversity and similarity of molecules are widely used in quantitative methods for designing (selecting) a representative set of molecules and for analyzing the relationship between chemical structure and biological activity. We review methods and algorithms for design of a diverse set of molecules in the chemical space using clustering, cell-based partitioning, or other distance-based approaches. Analogous cell-based and clustering methods are described for analyzing drug-discovery data to predict activity in virtual screening. Some performance comparisons are made. The choice of descriptor variables to characterize chemical structure is also included in the comparative study. We find that the diversity of a selected set is quite sensitive to both the statistical selection method and the choice of molecular descriptors and that, for the dataset used in this study, random selection works surprisingly well in providing a set of data for analysis.
Key WordsBiological activity cell-based partitioning chemical descriptors classification clustering distance-based design diversity selection high-throughput screening quantitative structure-activity relationship
- 2.Engels, M. F. M. and Venkatarangan, P. (2001) Smart screening: approaches to efficient HTS. Curr. Opin. Drug Disc. Dev. 4, 275–283.Google Scholar
- 7.Leach, A. R. and Gillet, V. J. (2003) An introduction to chemoinformatics. Kluwer Academic Publishers, London, UK.Google Scholar
- 8.Brown, R. D. and Martin, Y. C. (1996) Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci. 36, 572–584.Google Scholar
- 10.Burden, F. R. (1989) Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 29, 225–227.Google Scholar
- 12.Hastie, T., Tibshirani, R., and Friedman, J. (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York, NY.Google Scholar
- 17.Menard, P. R., Mason, J. S., Morize, I., and Bauerschmidt, S. (1998) Chemistry space metrics in diversity analysis, library design, and compound selection. J. Chem. Inf. Comput. Sci. 38, 1204–1213.Google Scholar
- 19.Lam, R. L. H. (2001) Design and analysis of large chemical databases for drug discovery, Ph.D. Dissertation, University of Waterloo.Google Scholar
- 21.Pearlman, R. S. and Smith, K. M. (1999) Metric validation and the receptor-relevant subspace concept. J. Chem. Inf. Comput. Sci. 39, 28–35.Google Scholar
- 24.Higgs, R. E., Bemis, K. G., Watson, I. A., and Wikel, J. H. (1997) Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comput. Sci. 37, 861–870.Google Scholar
- 25.Lam, R. L. H., Welch, W. J., and Young, S. S. (2002) Cell-based analysis of high throughput screening data for drug discovery. Research Report RR-02-02, Institute for Improvement in Quality and Productivity, University of Waterloo.Google Scholar
- 27.Young, S. S., Farmen, M., and Rusinko, A. III (1996) Random versus rational: Which is better for general compound screening? Network Science online publication, available at URL: http://www.netsci.org/Science/Screening/feature09.html.