Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

  • Alexander Golbraikh
  • Alexander Tropsha


One of the most important characteristics of Quantitative Structure Activity Relashionships (QSAR) models is their predictive power. The latter can be defined as the ability of a model to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414–425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.


Model Development Predictive Power Diversity Principle Structure Activity Modeling Research 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hansch, C., Fujita, T., J. Am. Chem. Soc., 86 (1964) 1616–1626.Google Scholar
  2. 2.
    Kubinyi, H., In: Mannhold, R. et al. (eds.) Methods and Principles in Medicinal Chemistry, VCH, Weinheim, 1993.Google Scholar
  3. 3.
    Randić, M., J. Am. Chem. Soc., 97 (1975) 6609–6615.Google Scholar
  4. 4.
    Kier, L.B. and Hall, L.H., Molecular Connectivity in Chemistry and Drug Research. Academic Press, New York, 1976.Google Scholar
  5. 5.
    Kier, L.B. and Hall, L.H., Molecular Connectivity in Structure-Activity Analysis. Wiley, New York, 1986.Google Scholar
  6. 6.
    Kier, L.B., Quant. Struct.-Act. Relat. 4 (1985) 109–116.Google Scholar
  7. 7.
    Kier, L.B., Quant. Struct-Act. Relat. 6 (1987) 8–12.Google Scholar
  8. 8.
    Hall, L.H. and Kier, L.B., Quant. Struct.-Act. Relat 9 (1990) 115–131.Google Scholar
  9. 9.
    Hall, L.H., Mohney, B.K. and Kier, L.B., Quant. Struct.-Act. Relat., 10 (1991) 43–51.Google Scholar
  10. 10.
    Hall, L.H., Mohney, B.K. and Kier, L.B., J. Chem. Inf. Comput. Sci., 31 (1991) 76–82.Google Scholar
  11. 11.
    Kier, L.B. and Hall, L.H., Molecular Structure Description: The Electrotopological State, Academic Press, 1999.Google Scholar
  12. 12.
    Kellogg, G.E., Kier, L.B., Gaillard, P. and Hall, L.H., J. Comput. Aid. Mol. Des. 10 (1996) 513–520.Google Scholar
  13. 13.
    Sheridan, R.P., Nachbar, R.B. and Bush, B.L., J. Comput.-Aid Mol. Des. 8 (1994) 323–340.Google Scholar
  14. 14.
    Matter, H., J. Medic. Chem. 40(8) (1997) 1219–1229.Google Scholar
  15. 15.
    Clementi, S. and Wold, S., In: Waterbeemd, H. van de (ed.), Chemometrics Methods in Molecular Design, VCH, (1995) 319–338.Google Scholar
  16. 16.
    Wold, S., In: Waterbeemd, H. van de (ed.), Chemometrics Methods in, VCH, (1995) 195–218.Google Scholar
  17. 17.
    Hoffman B., Cho S.J., Zheng W., Wyrick S., Nichols D.E. and Mailman R.B., J. Med. Chem. 42 (1999) 3217–3226.Google Scholar
  18. 18.
    Zheng, W. and Tropsha, A., J. Chem. Inf. Comput. Sci., 40 (2000) 185–194.Google Scholar
  19. 19.
    Ajay. J. Med. Chem. 36 (1993) 3565–3571.Google Scholar
  20. 20.
    Cramer III, R.D., Patterson, D.E. and Bunce, J.D., J. Am. Chem. Soc. 110 (1988) 5959–5967.Google Scholar
  21. 21.
    Marshall, G.R. and Cramer III, R.D., Trends Pharmacol. Sci. 9 (1988) 285–289.Google Scholar
  22. 22.
    Pérez, C., Pastor, M., Ortiz, AR. and Gago, F., J. Med. Chem. 41 (1998) 836–852.Google Scholar
  23. 23.
    Cho, S.J. and Tropsha, A., J. Med. Chem. 38 (1995) 1060–1066.Google Scholar
  24. 24.
    Klebe, G., In: Kubinyi, H., Folkers, G., Martin, Y.C., (eds.) 3D QSAR in Drug Design. Volume 3. Recent Advances, Kluwer/ESCOM: Dordrecht, (1998) pp. 87–104.Google Scholar
  25. 25.
    Kubinyi, H., Hamprecht, F.A. and Mietzner, T., J. Med. Chem., 41 (1998) 2553–2564.Google Scholar
  26. 26.
    Topliss, J.G. and Edwards, R.P., J. Med. Chem. 22 (1979) 1238–1244.Google Scholar
  27. 27.
    Gironés, X., Gallegos, A. and Ramon, C.-D., J. Chem. Inf. Comput. Sci. 46 (2000) 1400–1407.Google Scholar
  28. 28.
    Bordás, B., Kömíves, T., Szántó, Z. and Lopata, A., J. Agric. Food Chem. 48 (2000) 926–931.Google Scholar
  29. 29.
    Fan, Y., Shi, L.M., Kohn, K.W., Pommier, Y. and Weinstein, J.N., J. Med. Chem. 44 (2001) 3254–3263.Google Scholar
  30. 30.
    Randić, M. and Basak, S.C., J. Chem. Inf. Comput. Sci. 40 (2000) 899–905.Google Scholar
  31. 31.
    Suzuki, T., Ide, K., Ishida, M. and Shapiro, S., J. Chem. Inf. Comput. Sci. 41 (2001) 718–726.Google Scholar
  32. 32.
    Recanatini, M., Cavalli, A., Belluti, F., Piazzi, L., Rampa, A., Bisi, A., Gobbi, S., Valenti, P., Andrisano, V., Bartolini, M. and Cavrini, V., J. Med. Chem. 43 (2000) 2007–2018.Google Scholar
  33. 33.
    Morón, J.A., Campillo, M., Perez, V., Unzeta, M. and Pardo, L., J. Med. Chem. 43 (2000) 1684–1691.Google Scholar
  34. 34.
    Golbraikh, A. and Tropsha, A., J. Mol. Graphics Model. 20 (2002) 269–276.Google Scholar
  35. 35.
    Wold, S. and Eriksson, L., Statistical Validation of QSAR Results. In: Waterbeemd, H. van de (ed.), Chemometrics Methods in Molecular Design, VCH, (1995) 309–318.Google Scholar
  36. 36.
    Clark, R.D., Sprous, D.G. and Leonard, J.M., Validating Models Based on Large Dataset. In: Höltje, H.-D., Sippl, W., (eds.) Rational Approaches to Drug Design. Proceedings of the 13th European Symposium on Quantitative Structure-Activity Relationships. Aug 27 – Sept 1 (2000), Duesseldorf, Germany. Prous Science, (2001) 475–485.Google Scholar
  37. 37.
    Novellino, E., Fattorusso, C. and Greco, G., Pharm. Acta Helv. 70 (1995) 149–154.Google Scholar
  38. 38.
    Norinder, U., J. Chemomet. 10 (1996) 95–105.Google Scholar
  39. 39.
    Zefirov, N.S. and Palyulin, V.A., J. Chem. Inf. Comput. Sci. 41 (2001) 1022–1027.Google Scholar
  40. 40.
    Sachs, L., Applied Statistics. A Handbook of Techniques. Springer-Verlag, (1984).Google Scholar
  41. 41.
    Huuskonen, J., J. Chem. Inf. Comput. Sci. 41 (2001) 425–429.Google Scholar
  42. 42.
    Tetko, I.V., Kovalishyn, V.V. and Livingstone D.J., J. Med. Chem. 44 (2001) 2411–2420.Google Scholar
  43. 43.
    Wu, W., Walczak, B., Massart, D.L., Heuerding, S., Erni, F., Last, I.R. and Prebble, K.A., Chemometr. Intell. Lab. Syst. 33 (1996) 35–46.Google Scholar
  44. 44.
    Yasri, A. and Hartsough, D., J. Chem. Inf. Comput. Sci. 41 (2001) 1218–1227Google Scholar
  45. 45.
    Bernard P., Kireev D.B., Chretien J.R., Fortier P.L. and Coppet L., J. Comput. Aided Mol. Des. 13 (1999) 355–371.Google Scholar
  46. 46.
    Takeuchi, Y., Shands, E.F.B., Beusen, D.D. and Marshall, G.R., J. Med. Chem. 41 (1998)3609–3623.Google Scholar
  47. 47.
    Kauffman, G.V. and Jurs, P.C., J. Chem. Inf. Comput. Sci. 41 (2001) 1553–1560.Google Scholar
  48. 48.
    Mattioni, B.E. and Jurs, P.C., J. Chem. Inf. Comput. Sci., in press.Google Scholar
  49. 49.
    Gasteiger, J. and Zupan, J., Angewandte chemie. 32(4) (1993) 503.Google Scholar
  50. 50.
    Loukas, Y.L., J. Med. Chem. 44 (2001) 2772–2783.Google Scholar
  51. 51.
    Bernard, P, Pintore, M, Berthon, J.Y. and Chretien, J.R., Eur. J. Med. Chem. 36 (2001) 1–19.Google Scholar
  52. 52.
    Burden, F.R. and Winkler, D.A., J. Med. Chem. 42 (1999) 3183–3187.Google Scholar
  53. 53.
    Burden, F.R., Ford, M.G., Whitley, D.C. and Winkler, D.A., J. Chem. Inf. Comput. Sci. 40 (2000) 1423–1430.Google Scholar
  54. 54.
    Adams, M.J., Chemometrics in Analytical Spectroscopy. The Royal Society of Chemistry, UK, 1995.Google Scholar
  55. 55.
    Potter, T. and Matter, H., J. Med. Chem. 41 (1998) 478–488.Google Scholar
  56. 56.
    Lajiness, M., Johnson, M.A. and Maggiora, G.M., In: Fauchere, J.L., (ed.), QSAR: Quantitative Structure-Activity Relationships in Drug Design Alan R. Liss Inc.: New York, (1989) pp. 173–176.Google Scholar
  57. 57.
    Taylor, R., J. Chem. Inf. Comput. Sci. 35 (1995) 59–67.Google Scholar
  58. 58.
    Snarey, M., Terrett, N.K., Willett, P. and Wilton, D.J., J. Mol. Graphics Mod. 15 (1997) 372–385.Google Scholar
  59. 59.
    Kennard, R.W. and Stone, L.A., Technometrics 11 (1969) 137–148.Google Scholar
  60. 60.
    Bourguignon, B., Deaguiar, P.F., Thorre, K. and Massart, D.L., J. Chromatogr. Sci. 32 (1994) 144–152.Google Scholar
  61. 61.
    Bourguignon, B., Deaguiar, P.F., Khots, M.S. and Massart, D.L., Anal. Chem. 66 (1994) 893–904.Google Scholar
  62. 62.
    Hellberg, S., Eriksson, L., Jonsson, J., Lindgren, F., Sjostrom, M., Skagerberg, B., Wold, S. and Andrews, P., Int. J. Pept. Protein. Res. 37 (1991) 414–424.Google Scholar
  63. 63.
    Eriksson, L. and Johansson, E., Chemometr. Intell. Lab. Syst. 34 (1996) 1–19.Google Scholar
  64. 64.
    Carlson, R., Design and Optimization in Organic Synthesis. Elsevier, (1992).Google Scholar
  65. 65.
    Martin, E.J. and Critchlow, R.E., J. Comb. Chem. 1 (1999) 32–45.Google Scholar
  66. 66.
    Miller, A. and Nguyen, N.-K., Appl. Stat. 43 (1994) 669–678.Google Scholar
  67. 67.
    Mitchell, T.J., Technometrics 16 (1974) 203–210.Google Scholar
  68. 68.
    Mitchell, T.J., Technometrics 42 (2000) 48–54.Google Scholar
  69. 69.
    Reynolds, C.H., Druker, R. and Pfahler, L.B., J. Chem. Inf. Comput. Sci. 38 (1998) 305–312.Google Scholar
  70. 70.
    Molconn-Z. Scholar
  71. 71.
    Bucholz, E., Brown, R.L., Tropsha, A., Booth, R.G. and Wyrick, S.D., J. Med. Chem. 42 (1999) 3041–3054.Google Scholar
  72. 72.
    Golbraikh, A., Bonchev, D., Xiao, Y.-D. and Tropsha, A., In: Rational Approaches to Drug Design. Proceedings of the 13th European Symposium on quantitative Structure-Activity relationships, Prous Science, (2001) pp. 219–223.Google Scholar
  73. 73.
    Golbraikh A., Bonchev, D. and Tropsha, A., J. Chem. Inf. Comput. Sci. 41 (2001) 147–158.Google Scholar
  74. 74.
    Kier, L.B. and Hall, L.H., Quant. Struct.-Act. Relat. 10 (1991) 134–140.Google Scholar
  75. 75.
    Petitjean, M., J. Chem. Inf. Comput. Sci. 32 (1992) 331–337.Google Scholar
  76. 76.
    Wiener, H., J. Am. Chem. Soc. 69 (1947) 17.Google Scholar
  77. 77.
    Platt, J.R., J. Phys. Chem. 56 (1952) 328.Google Scholar
  78. 78.
    Shannon, C. and Weaver, W., Mathematical theory of Communication, University of Illinois, Urbana, (1949).Google Scholar
  79. 79.
    Bonchev, D., Mekenyan, O. and Trinajstic, N., J. Comput. Chem., 2 (1981) 127–148.Google Scholar
  80. 80.
    Gutman I., Ruscić, B., Trinajstić, N. and Wilcox, C.F., Jr., J. Chem. Phys., 62 (1975) 3399.Google Scholar
  81. 81.
    Rücker, G. and Rücker, C., J. Chem. Inf. Comput. Sci., 33 (1993) 683–695.Google Scholar
  82. 82.
    Bonchev, D., In: Devillers J., Balaban, A.T. (eds.), Topological Indices and Related Descriptors, Gordon and Breach, Reading, U.K. (1999) pp. 361–401.Google Scholar
  83. 83.
    Bonchev, D., SAR/QSAR Env. Res., 7 (1997) 23–43.Google Scholar
  84. 84.
    Golbraikh, A., J. Chem. Inf. Comput. Sci. 40 (2000) 414–425.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Alexander Golbraikh
    • 1
  • Alexander Tropsha
    • 1
  1. 1.The Laboratory for Molecular Modeling, School of PharmacyUniversity of North CarolinaChapel Hill

Personalised recommendations