Skip to main content
Log in

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

One of the most important characteristics of Quantitative Structure Activity Relashionships (QSAR) models is their predictive power. The latter can be defined as the ability of a model to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414–425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Hansch, C., Fujita, T., J. Am. Chem. Soc., 86 (1964) 1616–1626.

    Google Scholar 

  2. Kubinyi, H., In: Mannhold, R. et al. (eds.) Methods and Principles in Medicinal Chemistry, VCH, Weinheim, 1993.

    Google Scholar 

  3. Randić, M., J. Am. Chem. Soc., 97 (1975) 6609–6615.

    Google Scholar 

  4. Kier, L.B. and Hall, L.H., Molecular Connectivity in Chemistry and Drug Research. Academic Press, New York, 1976.

    Google Scholar 

  5. Kier, L.B. and Hall, L.H., Molecular Connectivity in Structure-Activity Analysis. Wiley, New York, 1986.

    Google Scholar 

  6. Kier, L.B., Quant. Struct.-Act. Relat. 4 (1985) 109–116.

    Google Scholar 

  7. Kier, L.B., Quant. Struct-Act. Relat. 6 (1987) 8–12.

    Google Scholar 

  8. Hall, L.H. and Kier, L.B., Quant. Struct.-Act. Relat 9 (1990) 115–131.

    Google Scholar 

  9. Hall, L.H., Mohney, B.K. and Kier, L.B., Quant. Struct.-Act. Relat., 10 (1991) 43–51.

    Google Scholar 

  10. Hall, L.H., Mohney, B.K. and Kier, L.B., J. Chem. Inf. Comput. Sci., 31 (1991) 76–82.

    Google Scholar 

  11. Kier, L.B. and Hall, L.H., Molecular Structure Description: The Electrotopological State, Academic Press, 1999.

  12. Kellogg, G.E., Kier, L.B., Gaillard, P. and Hall, L.H., J. Comput. Aid. Mol. Des. 10 (1996) 513–520.

    Google Scholar 

  13. Sheridan, R.P., Nachbar, R.B. and Bush, B.L., J. Comput.-Aid Mol. Des. 8 (1994) 323–340.

    Google Scholar 

  14. Matter, H., J. Medic. Chem. 40(8) (1997) 1219–1229.

    Google Scholar 

  15. Clementi, S. and Wold, S., In: Waterbeemd, H. van de (ed.), Chemometrics Methods in Molecular Design, VCH, (1995) 319–338.

  16. Wold, S., In: Waterbeemd, H. van de (ed.), Chemometrics Methods in, VCH, (1995) 195–218.

  17. Hoffman B., Cho S.J., Zheng W., Wyrick S., Nichols D.E. and Mailman R.B., J. Med. Chem. 42 (1999) 3217–3226.

    Google Scholar 

  18. Zheng, W. and Tropsha, A., J. Chem. Inf. Comput. Sci., 40 (2000) 185–194.

    Google Scholar 

  19. Ajay. J. Med. Chem. 36 (1993) 3565–3571.

    Google Scholar 

  20. Cramer III, R.D., Patterson, D.E. and Bunce, J.D., J. Am. Chem. Soc. 110 (1988) 5959–5967.

    Google Scholar 

  21. Marshall, G.R. and Cramer III, R.D., Trends Pharmacol. Sci. 9 (1988) 285–289.

    Google Scholar 

  22. Pérez, C., Pastor, M., Ortiz, AR. and Gago, F., J. Med. Chem. 41 (1998) 836–852.

    Google Scholar 

  23. Cho, S.J. and Tropsha, A., J. Med. Chem. 38 (1995) 1060–1066.

    Google Scholar 

  24. Klebe, G., In: Kubinyi, H., Folkers, G., Martin, Y.C., (eds.) 3D QSAR in Drug Design. Volume 3. Recent Advances, Kluwer/ESCOM: Dordrecht, (1998) pp. 87–104.

    Google Scholar 

  25. Kubinyi, H., Hamprecht, F.A. and Mietzner, T., J. Med. Chem., 41 (1998) 2553–2564.

    Google Scholar 

  26. Topliss, J.G. and Edwards, R.P., J. Med. Chem. 22 (1979) 1238–1244.

    Google Scholar 

  27. Gironés, X., Gallegos, A. and Ramon, C.-D., J. Chem. Inf. Comput. Sci. 46 (2000) 1400–1407.

    Google Scholar 

  28. Bordás, B., Kömíves, T., Szántó, Z. and Lopata, A., J. Agric. Food Chem. 48 (2000) 926–931.

    Google Scholar 

  29. Fan, Y., Shi, L.M., Kohn, K.W., Pommier, Y. and Weinstein, J.N., J. Med. Chem. 44 (2001) 3254–3263.

    Google Scholar 

  30. Randić, M. and Basak, S.C., J. Chem. Inf. Comput. Sci. 40 (2000) 899–905.

    Google Scholar 

  31. Suzuki, T., Ide, K., Ishida, M. and Shapiro, S., J. Chem. Inf. Comput. Sci. 41 (2001) 718–726.

    Google Scholar 

  32. Recanatini, M., Cavalli, A., Belluti, F., Piazzi, L., Rampa, A., Bisi, A., Gobbi, S., Valenti, P., Andrisano, V., Bartolini, M. and Cavrini, V., J. Med. Chem. 43 (2000) 2007–2018.

    Google Scholar 

  33. Morón, J.A., Campillo, M., Perez, V., Unzeta, M. and Pardo, L., J. Med. Chem. 43 (2000) 1684–1691.

    Google Scholar 

  34. Golbraikh, A. and Tropsha, A., J. Mol. Graphics Model. 20 (2002) 269–276.

    Google Scholar 

  35. Wold, S. and Eriksson, L., Statistical Validation of QSAR Results. In: Waterbeemd, H. van de (ed.), Chemometrics Methods in Molecular Design, VCH, (1995) 309–318.

  36. Clark, R.D., Sprous, D.G. and Leonard, J.M., Validating Models Based on Large Dataset. In: Höltje, H.-D., Sippl, W., (eds.) Rational Approaches to Drug Design. Proceedings of the 13th European Symposium on Quantitative Structure-Activity Relationships. Aug 27 – Sept 1 (2000), Duesseldorf, Germany. Prous Science, (2001) 475–485.

  37. Novellino, E., Fattorusso, C. and Greco, G., Pharm. Acta Helv. 70 (1995) 149–154.

    Google Scholar 

  38. Norinder, U., J. Chemomet. 10 (1996) 95–105.

    Google Scholar 

  39. Zefirov, N.S. and Palyulin, V.A., J. Chem. Inf. Comput. Sci. 41 (2001) 1022–1027.

    Google Scholar 

  40. Sachs, L., Applied Statistics. A Handbook of Techniques. Springer-Verlag, (1984).

  41. Huuskonen, J., J. Chem. Inf. Comput. Sci. 41 (2001) 425–429.

    Google Scholar 

  42. Tetko, I.V., Kovalishyn, V.V. and Livingstone D.J., J. Med. Chem. 44 (2001) 2411–2420.

    Google Scholar 

  43. Wu, W., Walczak, B., Massart, D.L., Heuerding, S., Erni, F., Last, I.R. and Prebble, K.A., Chemometr. Intell. Lab. Syst. 33 (1996) 35–46.

    Google Scholar 

  44. Yasri, A. and Hartsough, D., J. Chem. Inf. Comput. Sci. 41 (2001) 1218–1227

    Google Scholar 

  45. Bernard P., Kireev D.B., Chretien J.R., Fortier P.L. and Coppet L., J. Comput. Aided Mol. Des. 13 (1999) 355–371.

    Google Scholar 

  46. Takeuchi, Y., Shands, E.F.B., Beusen, D.D. and Marshall, G.R., J. Med. Chem. 41 (1998)3609–3623.

    Google Scholar 

  47. Kauffman, G.V. and Jurs, P.C., J. Chem. Inf. Comput. Sci. 41 (2001) 1553–1560.

    Google Scholar 

  48. Mattioni, B.E. and Jurs, P.C., J. Chem. Inf. Comput. Sci., in press.

  49. Gasteiger, J. and Zupan, J., Angewandte chemie. 32(4) (1993) 503.

    Google Scholar 

  50. Loukas, Y.L., J. Med. Chem. 44 (2001) 2772–2783.

    Google Scholar 

  51. Bernard, P, Pintore, M, Berthon, J.Y. and Chretien, J.R., Eur. J. Med. Chem. 36 (2001) 1–19.

    Google Scholar 

  52. Burden, F.R. and Winkler, D.A., J. Med. Chem. 42 (1999) 3183–3187.

    Google Scholar 

  53. Burden, F.R., Ford, M.G., Whitley, D.C. and Winkler, D.A., J. Chem. Inf. Comput. Sci. 40 (2000) 1423–1430.

    Google Scholar 

  54. Adams, M.J., Chemometrics in Analytical Spectroscopy. The Royal Society of Chemistry, UK, 1995.

  55. Potter, T. and Matter, H., J. Med. Chem. 41 (1998) 478–488.

    Google Scholar 

  56. Lajiness, M., Johnson, M.A. and Maggiora, G.M., In: Fauchere, J.L., (ed.), QSAR: Quantitative Structure-Activity Relationships in Drug Design Alan R. Liss Inc.: New York, (1989) pp. 173–176.

    Google Scholar 

  57. Taylor, R., J. Chem. Inf. Comput. Sci. 35 (1995) 59–67.

    Google Scholar 

  58. Snarey, M., Terrett, N.K., Willett, P. and Wilton, D.J., J. Mol. Graphics Mod. 15 (1997) 372–385.

    Google Scholar 

  59. Kennard, R.W. and Stone, L.A., Technometrics 11 (1969) 137–148.

    Google Scholar 

  60. Bourguignon, B., Deaguiar, P.F., Thorre, K. and Massart, D.L., J. Chromatogr. Sci. 32 (1994) 144–152.

    Google Scholar 

  61. Bourguignon, B., Deaguiar, P.F., Khots, M.S. and Massart, D.L., Anal. Chem. 66 (1994) 893–904.

    Google Scholar 

  62. Hellberg, S., Eriksson, L., Jonsson, J., Lindgren, F., Sjostrom, M., Skagerberg, B., Wold, S. and Andrews, P., Int. J. Pept. Protein. Res. 37 (1991) 414–424.

    Google Scholar 

  63. Eriksson, L. and Johansson, E., Chemometr. Intell. Lab. Syst. 34 (1996) 1–19.

    Google Scholar 

  64. Carlson, R., Design and Optimization in Organic Synthesis. Elsevier, (1992).

  65. Martin, E.J. and Critchlow, R.E., J. Comb. Chem. 1 (1999) 32–45.

    Google Scholar 

  66. Miller, A. and Nguyen, N.-K., Appl. Stat. 43 (1994) 669–678.

    Google Scholar 

  67. Mitchell, T.J., Technometrics 16 (1974) 203–210.

    Google Scholar 

  68. Mitchell, T.J., Technometrics 42 (2000) 48–54.

    Google Scholar 

  69. Reynolds, C.H., Druker, R. and Pfahler, L.B., J. Chem. Inf. Comput. Sci. 38 (1998) 305–312.

    Google Scholar 

  70. Molconn-Z. http://www.eslc.vabiotech.com/

  71. Bucholz, E., Brown, R.L., Tropsha, A., Booth, R.G. and Wyrick, S.D., J. Med. Chem. 42 (1999) 3041–3054.

    Google Scholar 

  72. Golbraikh, A., Bonchev, D., Xiao, Y.-D. and Tropsha, A., In: Rational Approaches to Drug Design. Proceedings of the 13th European Symposium on quantitative Structure-Activity relationships, Prous Science, (2001) pp. 219–223.

  73. Golbraikh A., Bonchev, D. and Tropsha, A., J. Chem. Inf. Comput. Sci. 41 (2001) 147–158.

    Google Scholar 

  74. Kier, L.B. and Hall, L.H., Quant. Struct.-Act. Relat. 10 (1991) 134–140.

    Google Scholar 

  75. Petitjean, M., J. Chem. Inf. Comput. Sci. 32 (1992) 331–337.

    Google Scholar 

  76. Wiener, H., J. Am. Chem. Soc. 69 (1947) 17.

    Google Scholar 

  77. Platt, J.R., J. Phys. Chem. 56 (1952) 328.

    Google Scholar 

  78. Shannon, C. and Weaver, W., Mathematical theory of Communication, University of Illinois, Urbana, (1949).

    Google Scholar 

  79. Bonchev, D., Mekenyan, O. and Trinajstic, N., J. Comput. Chem., 2 (1981) 127–148.

    Google Scholar 

  80. Gutman I., Ruscić, B., Trinajstić, N. and Wilcox, C.F., Jr., J. Chem. Phys., 62 (1975) 3399.

    Google Scholar 

  81. Rücker, G. and Rücker, C., J. Chem. Inf. Comput. Sci., 33 (1993) 683–695.

    Google Scholar 

  82. Bonchev, D., In: Devillers J., Balaban, A.T. (eds.), Topological Indices and Related Descriptors, Gordon and Breach, Reading, U.K. (1999) pp. 361–401.

    Google Scholar 

  83. Bonchev, D., SAR/QSAR Env. Res., 7 (1997) 23–43.

    Google Scholar 

  84. Golbraikh, A., J. Chem. Inf. Comput. Sci. 40 (2000) 414–425.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Golbraikh, A., Tropsha, A. Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection. J Comput Aided Mol Des 16, 357–369 (2002). https://doi.org/10.1023/A:1020869118689

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1020869118689

Keywords

Navigation