Statistical variation in progressive scrambling

Article

Abstract

The two methods most often used to evaluate the robustness and predictivity of partial least squares (PLS) models are cross-validation and response randomization. Both methods may be overly optimistic for data sets that contain redundant observations, however. The kinds of perturbation analysis widely used for evaluating model stability in the context of ordinary least squares regression are only applicable when the descriptors are independent of each other and errors are independent and normally distributed; neither assumption holds for QSAR in general and for PLS in particular. Progressive scrambling is a novel, non-parametric approach to perturbing models in the response space in a way that does not disturb the underlying covariance structure of the data. Here, we introduce adjustments for two of the characteristic values produced by a progressive scrambling analysis -- the deprecated predictivity (\(Q_{\rm s}^{\ast^2}\)) and standard error of prediction (SDEPs*) -- that correct for the effect of introduced perturbation. We also explore the statistical behavior of the adjusted values (\(Q_{\rm 0}^{\ast^2}\) and SDEP0*) and the sensitivity to perturbation (dq2/dryy ′2). It is shown that the three statistics are all robust for stable PLS models, in terms of the stochastic component of their determination and of their variation due to sampling effects involved in training set selection.

Keywords

cross-validation PLS progressive scrambling redundancy response randomization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Snedecor, G.W., Cochran, W.G. 1989Statistical Methods10Iowa State PressAmes, IAGoogle Scholar
  2. Martens, H., Næs, T. 1989Multivariate CalibrationWileyChichester, UKGoogle Scholar
  3. Wold, S., Johansson, E. and Cocchi, M., In Kubinyi, H. (Ed.), 3D QSAR in Drug Design: Theory, Methods and Applications, ESCOM, Leiden, The Netherlands, 1993, pp. 523–550.Google Scholar
  4. Zupan, J., Gasteiger, J. 1999Neural Networks in Chemistry and Drug Design2Wiley-VCHWeinheim, GermanyGoogle Scholar
  5. Wold, S. and Eriksson, L., In van de Waterbeemd, H. (Ed.), Chemometric Methods in Molecular Design, VCH, Weinheim, Germany, 1995, pp. 309–318.Google Scholar
  6. Tropsha, A., Grammatica, P., Gombar, V.K. 2003QSAR Comb. Sci.,2269Google Scholar
  7. Golbraikh, A., Tropsha, A. 2002J. Mol. Graph. Model.,20269Google Scholar
  8. Clark, R.D. 2003J. Comput.-Aided Mol. Des.,17265Google Scholar
  9. Hawkins, D.M., Basak, S.C., Mills, D. 2003J. Chem. Inf. Comput. Sci.,43579Google Scholar
  10. Baumann, K., von Korff, M., Albert, H. 2002J. Chemom.,16351Google Scholar
  11. Hawkins, D.M. 2004J. Chem. Inf. Comput. Sci.,441Google Scholar
  12. Heritage, T.W. and Lowis, D.R., In Parrill, A.L. and Reddy, M.R. (Eds.), Rational Drug Design: Novel Methodology and Practical Applications, ACS Symposium Series 719, American Chemical Society, Washington, DC, 1999, pp. 212–225.Google Scholar
  13. Clark, R.D., Sprous, D.G. and Leonard, J.M., In Höltje, H.-D. and Sippl, W. (Eds.), Rational Approaches to Drug Design, Prous Science, Barcelona, Spain, 2001, pp. 475–485.Google Scholar
  14. Kireev, D.B., Chrétien, J.R., Grierson, D.S., Monneret, C. 1997J. Med. Chem.,404257Google Scholar
  15. Luco, J.M., Ferretti, F.H. 1997J. Chem. Inf. Comput. Sci.,37392Google Scholar
  16. HQSAR™ is distributed by Tripos, Inc., St. Louis, MO; www.tripos.com.Google Scholar
  17. Cramer, R.D., Patterson, D.E., Bunce, J.D. 1988J. Am. Chem. Soc.,1105959Google Scholar
  18. Cramer III, R.D., DePriest, S.A., Patterson, D.E. and Hecht, P., In Kubinyi, H. (Ed.), 3D QSAR in Drug Design: Theory, Methods and Applications, ESCOM, Leiden, The Netherlands, 1993, pp. 443–485.Google Scholar
  19. Chavatte, P., Yous, S., Marot, C., Baurin, N., Lesiur, D. 2001J. Med. Chem.,443223Google Scholar
  20. Voet, H. 1999J. Chemom.,13195Google Scholar
  21. Kalivas, J.H., Forrester, J.B., Seipel, H.A. 2004J. Comput.-Aided Mol. Design18537Google Scholar
  22. In fact, Equation 5 in Ref. 13 includes a typographical error, with sSDEP′ substituted for s.Google Scholar
  23. Clark, M., Cramer, R.D., Jones, D.M., Patterson, D.E., Simeroth, P.E. 1990Tetrahedron Comput. Methodol.,347Google Scholar
  24. Advanced CoMFA® and SYBYL® are distributed by Tripos, Inc., St. Louis, MO; www.tripos.com.Google Scholar
  25. Bush, B.L., Nachbar, R.B. 1993J. Comput.-Aided Mol. Design,7587Google Scholar
  26. Given that the most statistically powerful model will always be the one based on all available observations [Refs. 9–11].Google Scholar
  27. Otto, M. 1999ChemometricsWiley-VCHWeinheim, GermanyGoogle Scholar
  28. A full factorial design includes two observations for each first-order factor, each of which is a partial replicate of its complement in the descriptor space (see Ref. 27).Google Scholar

Copyright information

© Springer 2004

Authors and Affiliations

  1. 1.Tripos, Inc.St. LouisUSA

Personalised recommendations