Molecular Diversity

, Volume 5, Issue 4, pp 231–243 | Cite as

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

  • Alexander Golbraikh
  • Alexander Tropsha


One of the most important characteristics of Quantitative Structure ActivityRelashionships (QSAR) models is their predictive power. The latter can bedefined as the ability of a model to predict accurately the target property(e.g., biological activity) of compounds that were not used for model development.We suggest that this goal can be achieved by rational division of an experimentalSAR dataset into the training and test set, which are used for model developmentand validation, respectively. Given that all compounds are represented by pointsin multidimensional descriptor space, we argue that training and test sets mustsatisfy the following criteria: (i) Representative points of the test set must beclose to those of the training set; (ii) Representative points of the training setmust be close to representative points of the test set; (iii) Training set must bediverse. For quantitative description of these criteria, we use molecular datasetdiversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci.,40 (2000) 414–425). For rational division of a dataset into the training and testsets, we use three closely related sphere-exclusion algorithms. Using severalexperimental datasets, we demonstrate that QSAR models built and validated withour approach have statistically better predictive power than models generated witheither random or activity ranking based selection of the training andtest sets.We suggest that rational approaches to the selection of training andtest setsbased on diversity principles should be used routinely in all QSAR modelingresearch.


Polymer Biological Activity Predictive Power Diversity Principle Diversity Sampling 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hansch, C., Fujita, T., J. Am. Chem. Soc., 86 (1964) 1616–1626.Google Scholar
  2. 2.
    Kubinyi, H., In: Mannhold, R. et al. (eds.) Methods and Principles in Medicinal Chemistry, VCH, Weinheim, 1993.Google Scholar
  3. 3.
    Randi´c, M., J. Am. Chem. Soc., 97 (1975) 6609–6615.Google Scholar
  4. 4.
    Kier, L.B. and Hall, L.H., Molecular Connectivity in Chemistry and Drug Research. Academic Press, New York, 1976.Google Scholar
  5. 5.
    Kier, L.B. and Hall, L.H., Molecular Connectivity in Structure-Activity Analysis. Wiley, New York, 1986.Google Scholar
  6. 6.
    Kier, L.B., Quant. Struct.-Act. Relat. 4 (1985) 109–116.Google Scholar
  7. 7.
    Kier, L.B., Quant. Struct-Act. Relat. 6 (1987) 8–12.Google Scholar
  8. 8.
    Hall, L.H. and Kier, L.B., Quant. Struct.-Act. Relat 9 (1990) 115–131.Google Scholar
  9. 9.
    Hall, L.H., Mohney, B.K. and Kier, L.B., Quant. Struct.-Act. Relat., 10 (1991) 43–51.Google Scholar
  10. 10.
    Hall, L.H., Mohney, B.K. and Kier, L.B., J. Chem. Inf. Comput. Sci., 31 (1991) 76–82.Google Scholar
  11. 11.
    Kier, L.B. and Hall, L.H., Molecular Structure Description: The Electrotopological State, Academic Press, 1999.Google Scholar
  12. 12.
    Kellogg, G.E., Kier, L.B., Gaillard, P. and Hall, L.H., J. Comput. Aid. Mol. Des. 10 (1996) 513–520.Google Scholar
  13. 13.
    Sheridan, R.P., Nachbar, R.B. and Bush, B.L., J. Comput.-Aid Mol. Des. 8 (1994) 323–340.Google Scholar
  14. 14.
    Matter, H., J. Medic. Chem. 40(8) (1997) 1219–1229.Google Scholar
  15. 15.
    Clementi, S. and Wold, S., In: Waterbeemd, H. van de (ed.), Chemometrics Methods in Molecular Design, VCH, (1995) 319–338.Google Scholar
  16. 16.
    Wold, S., In: Waterbeemd, H. van de (ed.), Chemometrics Methods in, VCH, (1995) 195–218.Google Scholar
  17. 17.
    Hoffman B., Cho S.J., Zheng W., Wyrick S., Nichols D.E. and Mailman R.B., J. Med. Chem. 42 (1999) 3217–3226.Google Scholar
  18. 18.
    Zheng, W. and Tropsha, A., J. Chem. Inf. Comput. Sci., 40 (2000) 185–194.Google Scholar
  19. 19.
    Ajay. J. Med. Chem. 36 (1993) 3565–3571.Google Scholar
  20. 20.
    Cramer III, R.D., Patterson, D.E. and Bunce, J.D., J. Am. Chem. Soc. 110 (1988) 5959–5967.Google Scholar
  21. 21.
    Marshall, G.R. and Cramer III, R.D., Trends Pharmacol. Sci. 9 (1988) 285–289.Google Scholar
  22. 22.
    Pérez, C., Pastor, M., Ortiz, AR. and Gago, F., J. Med. Chem. 41 (1998) 836–852.Google Scholar
  23. 23.
    Cho, S.J. and Tropsha, A., J. Med. Chem. 38 (1995) 1060–1066.Google Scholar
  24. 24.
    Klebe, G., In: Kubinyi, H., Folkers, G., Martin, Y.C., (eds.) 3D QSAR in Drug Design. Volume 3. Recent Advances, Kluwer/ESCOM: Dordrecht, (1998) pp. 87–104.Google Scholar
  25. 25.
    Kubinyi, H., Hamprecht, F.A. and Mietzner, T., J. Med. Chem., 41 (1998) 2553–2564.Google Scholar
  26. 26.
    Topliss, J.G. and Edwards, R.P., J. Med. Chem. 22 (1979) 1238–1244.Google Scholar
  27. 27.
    Gironés, X., Gallegos, A. and Ramon, C.-D., J. Chem. Inf. Comput. Sci. 46 (2000) 1400–1407.Google Scholar
  28. 28.
    Bordás, B., Kömíves, T., Szántó , Z. and Lopata, A., J. Agric. Food Chem. 48 (2000) 926–931.Google Scholar
  29. 29.
    Fan, Y., Shi, L.M., Kohn, K.W., Pommier, Y. and Weinstein, J.N., J. Med. Chem. 44 (2001) 3254–3263.Google Scholar
  30. 30.
    Randi´c, M. and Basak, S.C., J. Chem. Inf. Comput. Sci. 40 (2000) 899–905.Google Scholar
  31. 31.
    Suzuki, T., Ide, K., Ishida, M. and Shapiro, S., J. Chem. Inf. Comput. Sci. 41 (2001) 718–726.Google Scholar
  32. 32.
    Recanatini, M., Cavalli, A., Belluti, F., Piazzi, L., Rampa, A., Bisi, A., Gobbi, S., Valenti, P., Andrisano, V., Bartolini, M. and Cavrini, V., J. Med. Chem. 43 (2000) 2007–2018.Google Scholar
  33. 33.
    Moró n, J.A., Campillo, M., Perez, V., Unzeta, M. and Pardo, L., J. Med. Chem. 43 (2000) 1684–1691.Google Scholar
  34. 34.
    Golbraikh, A. and Tropsha, A., J. Mol. Graphics Model. 20 (2002) 269–276.Google Scholar
  35. 35.
    Wold, S. and Eriksson, L., Statistical Validation of QSAR Results. In: Waterbeemd, H. van de (ed.), Chemometrics Methods in Molecular Design, VCH, (1995) 309–318.Google Scholar
  36. 36.
    Clark, R.D., Sprous, D.G. and Leonard, J.M., Validating Models Based on Large Dataset. In: Höltje, H.-D., Sippl, W., (eds.) Rational Approaches to Drug Design. Proceedings of the 13th European Symposium on Quantitative Structure-Activity Relationships. Aug 27 - Sept 1 (2000), Duesseldorf, Germany. Prous Science, (2001) 475–485.Google Scholar
  37. 37.
    Novellino, E., Fattorusso, C. and Greco, G., Pharm. Acta Helv. 70 (1995) 149–154.Google Scholar
  38. 38.
    Norinder, U., J. Chemomet. 10 (1996) 95–105.Google Scholar
  39. 39.
    Zefirov, N.S. and Palyulin, V.A., J. Chem. Inf. Comput. Sci. 41 (2001) 1022–1027.Google Scholar
  40. 40.
    Sachs, L., Applied Statistics. A Handbook of Techniques. Springer-Verlag, (1984).Google Scholar
  41. 41.
    Huuskonen, J., J. Chem. Inf. Comput. Sci. 41 (2001) 425–429.Google Scholar
  42. 42.
    Tetko, I.V., Kovalishyn, V.V. and Livingstone D.J., J. Med. Chem. 44 (2001) 2411–2420.Google Scholar
  43. 43.
    Wu, W., Walczak, B., Massart, D.L., Heuerding, S., Erni, F., Last, I.R. and Prebble, K.A., Chemometr. Intell. Lab. Syst. 33 (1996) 35–46.Google Scholar
  44. 44.
    Yasri, A. and Hartsough, D., J. Chem. Inf. Comput. Sci. 41 (2001) 1218–1227Google Scholar
  45. 45.
    Bernard P., Kireev D.B., Chretien J.R., Fortier P.L. and Coppet L., J. Comput. Aided Mol. Des. 13 (1999) 355–371.Google Scholar
  46. 46.
    Takeuchi, Y., Shands, E.F.B., Beusen, D.D. and Marshall, G.R., J. Med. Chem. 41 (1998)3609–3623.Google Scholar
  47. 47.
    Kauffman, G.V. and Jurs, P.C., J. Chem. Inf. Comput. Sci. 41 (2001) 1553–1560.Google Scholar
  48. 48.
    Mattioni, B.E. and Jurs, P.C., J. Chem. Inf. Comput. Sci., in press.Google Scholar
  49. 49.
    Gasteiger, J. and Zupan, J., Angewandte chemie. 32(4) (1993) 503.Google Scholar
  50. 50.
    Loukas, Y.L., J. Med. Chem. 44 (2001) 2772–2783.Google Scholar
  51. 51.
    Bernard, P, Pintore, M, Berthon, J.Y. and Chretien, J.R., Eur. J. Med. Chem. 36 (2001) 1–19.Google Scholar
  52. 52.
    Burden, F.R. and Winkler, D.A., J. Med. Chem. 42 (1999) 3183–3187.Google Scholar
  53. 53.
    Burden, F.R., Ford, M.G., Whitley, D.C. and Winkler, D.A., J. Chem. Inf. Comput. Sci. 40 (2000) 1423–1430.Google Scholar
  54. 54.
    Adams, M.J., Chemometrics in Analytical Spectroscopy. The Royal Society of Chemistry, UK, 1995.Google Scholar
  55. 55.
    Potter, T. and Matter, H., J. Med. Chem. 41 (1998) 478–488.Google Scholar
  56. 56.
    Lajiness, M., Johnson, M.A. and Maggiora, G.M., In: Fauchere, J.L., (ed.), QSAR: Quantitative Structure-Activity Relationships in Drug Design Alan R. Liss Inc.: New York, (1989) pp. 173–176.Google Scholar
  57. 57.
    Taylor, R., J. Chem. Inf. Comput. Sci. 35 (1995) 59–67.Google Scholar
  58. 58.
    Snarey, M., Terrett, N.K., Willett, P. and Wilton, D.J., J. Mol. Graphics Mod. 15 (1997) 372–385.Google Scholar
  59. 59.
    Kennard, R.W. and Stone, L.A., Technometrics 11 (1969) 137–148.Google Scholar
  60. 60.
    Bourguignon, B., Deaguiar, P.F., Thorre, K. and Massart, D.L., J. Chromatogr. Sci. 32 (1994) 144–152.Google Scholar
  61. 61.
    Bourguignon, B., Deaguiar, P.F., Khots, M.S. and Massart, D.L., Anal. Chem. 66 (1994) 893–904.Google Scholar
  62. 62.
    Hellberg, S., Eriksson, L., Jonsson, J., Lindgren, F., Sjostrom, M., Skagerberg, B., Wold, S. and Andrews, P., Int. J. Pept. Protein. Res. 37 (1991) 414–424.Google Scholar
  63. 63.
    Eriksson, L. and Johansson, E., Chemometr. Intell. Lab. Syst. 34 (1996) 1–19.Google Scholar
  64. 64.
    Carlson, R., Design and Optimization in Organic Synthesis. Elsevier, (1992).Google Scholar
  65. 65.
    Martin, E.J. and Critchlow, R.E., J. Comb. Chem. 1 (1999) 32–45.Google Scholar
  66. 66.
    Miller, A. and Nguyen, N.-K., Appl. Stat. 43 (1994) 669–678.Google Scholar
  67. 67.
    Mitchell, T.J., Technometrics 16 (1974) 203–210.Google Scholar
  68. 68.
    Mitchell, T.J., Technometrics 42 (2000) 48–54.Google Scholar
  69. 69.
    Reynolds, C.H., Druker, R. and Pfahler, L.B., J. Chem. Inf. Comput. Sci. 38 (1998) 305–312.Google Scholar
  70. 71.
    Bucholz, E., Brown, R.L., Tropsha, A., Booth, R.G. and Wyrick, S.D., J. Med. Chem. 42 (1999) 3041–3054.Google Scholar
  71. 72.
    Golbraikh, A., Bonchev, D., Xiao, Y.-D. and Tropsha, A., In: Rational Approaches to Drug Design. Proceedings of the 13th European Symposium on quantitative Structure-Activity relationships, Prous Science, (2001) pp. 219–223.Google Scholar
  72. 73.
    Golbraikh A., Bonchev, D. and Tropsha, A., J. Chem. Inf. Comput. Sci. 41 (2001) 147–158.Google Scholar
  73. 74.
    Kier, L.B. and Hall, L.H., Quant. Struct.-Act. Relat. 10 (1991) 134–140.Google Scholar
  74. 75.
    Petitjean, M., J. Chem. Inf. Comput. Sci. 32 (1992) 331–337.Google Scholar
  75. 76.
    Wiener, H., J. Am. Chem. Soc. 69 (1947) 17.Google Scholar
  76. 77.
    Platt, J.R., J. Phys. Chem. 56 (1952) 328.Google Scholar
  77. 78.
    Shannon, C. and Weaver, W., Mathematical theory of Communication, University of Illinois, Urbana, (1949).Google Scholar
  78. 79.
    Bonchev, D., Mekenyan, O. and Trinajstic, N., J. Comput. Chem., 2 (1981) 127–148.Google Scholar
  79. 80.
    Gutman I., Ruscić, B., Trinajstić, N. and Wilcox, C.F., Jr., J. Chem. Phys., 62 (1975) 3399.Google Scholar
  80. 81.
    Rücker, G. and Rücker, C., J. Chem. Inf. Comput. Sci., 33 (1993) 683–695.Google Scholar
  81. 82.
    Bonchev, D., In: Devillers J., Balaban, A.T. (eds.), Topological Indices and Related Descriptors, Gordon and Breach, Reading, U.K. (1999) pp. 361–401.Google Scholar
  82. 83.
    Bonchev, D., SAR/QSAR Env. Res., 7 (1997) 23–43.Google Scholar
  83. 84.
    Golbraikh, A., J. Chem. Inf. Comput. Sci. 40 (2000) 414–425.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Alexander Golbraikh
    • 1
  • Alexander Tropsha
    • 1
  1. 1.The Laboratory for Molecular Modeling, School of PharmacyUniversity of North CarolinaChapel Hill

Personalised recommendations