Using Recursive Partitioning Analysis to Evaluate Compound Selection Methods

  • S. Stanley Young
  • Douglas M. Hawkins
Part of the Methods in Molecular Biology™ book series (MIMB, volume 275)


The design and analysis of a screening set for high throughput screening is complex. We examine three statistical strategies for compound selection, random, clustering, and space-filling. We examine two types of chemical descriptors, BCUTs and principal components of Dragon Constitutional descriptors. Based on the predictive power of multiple tree recursive partitioning, we reached the following tentative conclusions. Random designs appear to be as good as clustering and space-filling designs. For analysis, BCUTs appear to be better than principal components scores based upon Constitutional Descriptors. We confirm previous results that model-based selection of compounds can lead to improved screening hit rates.

Key Words

Decision trees high throughput screening initial screening sets random recursive partitioning recursive partitioning sequential screening 


  1. 1.
    Ishikawa, K. (1986) Guide to quality control, Productivity, Inc., Shelton, CT. See also, Scholar
  2. 2.
    Lam, R. L. H., Welch, W. J., and Young, S. S. (2002) Uniform coverage designs for molecule selection. Technometrics 44, 99–109.CrossRefGoogle Scholar
  3. 3.
    Hawkins, D. M., Young, S. S., and Rusinko, A. (1997) Analysis of a large structure-activity data set using recursive partitioning. Quantitaive Structure-Activity Relationship 16, 296–302.CrossRefGoogle Scholar
  4. 4.
    Rusinko, A. III, Farmen, M. W., Lambert, C. G., Brown, P. L., and Young, S. S. (1999) Analysis of a large structure/biological activity data set using recursive partitioning. J. Chem. Inf. Comput. Sci. 39, 1017–1026.PubMedGoogle Scholar
  5. 5.
    van Rhee, A. M., Stocker, J., Printzenhoff, D., Creeh, C., Wagoner, P. K., and Spear, K. L. (2001) Retrospective analysis of an experimental high-throughput screening data set by recursive partitioning. J. Comb. Chem. 3, 267–277.PubMedCrossRefGoogle Scholar
  6. 6.
    Abt, M., Lim, Y-B., Sacks, J., Xie, M., and Young, S. S. (2001) A sequential approach for identifying lead compounds in large chemical databases. Stat. Sci. 16, 154–168.CrossRefGoogle Scholar
  7. 7.
    Engels, M. F., and Venkatarangan, P. (2001) Smart screening: approaches to efficient HTS. Current Opinion Drug Discovery & Development 4, 275–283.Google Scholar
  8. 8.
    Xu, J. and Hagler, A. (2002) Review: chemoinformatics and drug discovery. Molecules 7, 566–600.CrossRefGoogle Scholar
  9. 9.
    Hawkins, D. M. and Kass, G. V. (1982) Automatic interaction detection. In Topics in applied multivariate analysis, Hawkins, D. M. (ed.), Cambridge Univ. Press, pp. 269–302.Google Scholar
  10. 10.
    Breiman, L., Friedman, J., Olshen, R. A., and Stone, C. J. (1984) Classification and regression trees. Wadsworth, New York, NY.Google Scholar
  11. 11.
    Quinlan, J. R. (1992) C4.5 programs for machine learning. Morgan Kaufmann Publishers, San Mateo, CA.Google Scholar
  12. 12.
    Burden, F. R. (1989) Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 29, 225–227.Google Scholar
  13. 13.
    Pearlman, R. S. and Smith, K. M. (1999) Metric validation and the receptor-relevant subspace concept. J. Chem. Inf. Comput. Sci. 39, 28–35.Google Scholar
  14. 14.
    Westfall, P. H. and Young, S. S. (1993) Resampling-based multiple testing. Wiley, New York, NY.Google Scholar
  15. 15.
    Hawkins, D. M. and Musser, B. J. (1999) One tree or a forest? Alternative dendrographic models. Computing Science and Statistics 30, 534–542Google Scholar
  16. 16.
  17. 17.
    Breiman, L. (2001) Statistical modeling: the two cultures. Stat. Sci. 16, 199–231.CrossRefGoogle Scholar
  18. 18.
    Stanton, D. T. (1999) Evaluation and use of BCUT descriptors in QSAR and QSPR studies. Chem. Inf. Comput. Sci. 39, 11–20.Google Scholar
  19. 19.
    Lam, R. L. H. (2001) Design and analysis of large chemical databases for drug discovery. Ph.D. Dissertation, University of Waterloo.Google Scholar
  20. 20.
    Yi, B., Hughes-Oliver, J. M., Zhu, L., and Young, S. S. (2002) A factorial design to optimize cell-based drug discovery analysis. J. Chem. Inf. Comput. Sci. 42, 1221–1229.PubMedGoogle Scholar
  21. 21.
  22. 22.
    Burden, F. R., and Winkler, D. A. (2000) A quantitative structure-activity relationships model for the acute toxicity of substituted benzenes to Tetrahymena pyriformis using Bayesian-regularized neural networks. Chem. Res. Toxicol. 13, 436–440.PubMedCrossRefGoogle Scholar
  23. 23.
    Jones-Hertzog, D. K., Mukhopadhyay, P., Keefer, C. E., and Young, S. S. (1999) Use of recursive partitioning in the sequential screening of G-protein-coupled receptors. J. Pharmacol. Toxicol. 42, 207–215.CrossRefGoogle Scholar
  24. 24.
    Young, S. S., Farmen, M., and Rusinko, A. III. Random versus rational: Which is better for general compound screening?

Copyright information

© Humana Press Inc. 2004

Authors and Affiliations

  • S. Stanley Young
    • 1
  • Douglas M. Hawkins
    • 2
  1. 1.National Institute of Statistical SciencesResearch Triangle ParkNorth CarolinaUSA
  2. 2.School of StatisticsUniversity of MinnesotaMinneapolisUSA

Personalised recommendations