Journal of Computer-Aided Molecular Design

, Volume 21, Issue 9, pp 485–498 | Cite as

Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules

  • Timon Sebastian Schroeter
  • Anton Schwaighofer
  • Sebastian Mika
  • Antonius Ter Laak
  • Detlev Suelzle
  • Ursula Ganzer
  • Nikolaus Heinrich
  • Klaus-Robert Müller


We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.


Solubility Aqueous Machine learning Drug discovery Domain of applicability Error bar Error estimation Gaussian Process Bayesian modeling Random forest Ensemble Decision tree Support vector machine Ridge regression Distance 



The authors gratefully acknowledge partial support from the PASCAL Network of Excellence (EU #506778) and DFG grant MU 987/4-1. We thank Vincent Schütz and Carsten Jahn for maintaining the PCADMET database, and Gilles Blanchard for implementing the random forest method as part of our machine learning toolbox.


  1. 1.
    Schwaighofer A, Schroeter T, Mika S, Laub J, ter Laak A, Sülzle D, Ganzer U, Heinrich N, Müller K-R (2007) J Chem Inf Model 47:407 URL Scholar
  2. 2.
    Balakin KV, Savchuk NP, Tetko IV (2006) Curr Med Chem 13:223CrossRefGoogle Scholar
  3. 3.
    Johnson SR, Zheng W (2006) The AAPS J 8:E27 URL Scholar
  4. 4.
    Göller AH, Matthias H, Jörg K, Timothy C (2006) J Chem Inf Model 46:648CrossRefGoogle Scholar
  5. 5.
    Delaney JS (2005) Drug Discovery Today 10:289CrossRefGoogle Scholar
  6. 6.
    Goldman BB, Walters WP (2006) Machine learning in computational chemistry, vol 2, chapter 8, Elsevier, pp 127Google Scholar
  7. 7.
    Netzeva TI, Worth AP, Aldenberg T, Benigni R, Cronin MTD, Gramatica P, Jaworska JS, Kahn S, Klopman G, Marchant CA, Myatt G, Nikolova-Jeliazkova N, Patlewicz GY, Perkins R, Roberts DW, Schultz TW, Stanton DT, van de Sandt JJM, Tong W, Veith G, Yang C (2005) Altern Lab Anim 33:1Google Scholar
  8. 8.
    Tetko IV, Bruneau P, Mewes H-W, Rohrer DC, Poda GI (2006) Drug Discovery Today 11:700CrossRefGoogle Scholar
  9. 9.
    Tropsha A (2006) Variable selection qsar modeling, model validation, and virtual screening. In: Spellmeyer DC (ed) Annual reports in computational chemistry, vol 2, chapter 7, Elsevier, pp 113Google Scholar
  10. 10.
    Bruneau P, McElroy NR (2004) J Chem Inf Model 44:1912CrossRefGoogle Scholar
  11. 11.
    Tong W, Xie Q, Hong U, Shi L, Fang H, Perkins R (2004) Environ Health Perspect 112:1249Google Scholar
  12. 12.
    Bruneau P, McElroy NR (2006) J Chem Inf Model 46:1379CrossRefGoogle Scholar
  13. 13.
    Silverman BW (1986) Density estimation for statistics and data analysis. Number 26 in Monographs on Statistics and Applied Probability. Chapman & HallGoogle Scholar
  14. 14.
    Manallack DT, Tehan BG, Gancia E, Hudson BD, Ford MG, Livingstone DJ, Whitley DC, Pitt WR (2003) J Chem Inf Model 43:674Google Scholar
  15. 15.
    Kühne R, Ebert R-U, Schüürmann G (2006) J Chem Inf Model 46:636CrossRefGoogle Scholar
  16. 16.
    Bender A, Mussa HY, Glen RC (2005) J Biomol Screen 10:658 Scholar
  17. 17.
    Sun H (2006) Chem Med Chem 1:315Google Scholar
  18. 18.
    Sadowski J, Schwab C, Gasteiger J Corina v3.1. Erlangen, GermanyGoogle Scholar
  19. 19.
    Todeschini R, Consonni V, Mauri A, Pavan M DRAGON v1.2. Milano, ItalyGoogle Scholar
  20. 20.
    Physical/Chemical Property Database (PHYSPROP). Syracuse, NY, USAGoogle Scholar
  21. 21.
    Beilstein CrossFire Database. San Ramon, CA, USAGoogle Scholar
  22. 22.
    Yalkowsky SH, Dannelfelser RM The arizona database of aqueous solubility. Tuscon, AZ, USAGoogle Scholar
  23. 23.
    Huuskonen J (2000) J Chem Inf Comput Sci 40:773CrossRefGoogle Scholar
  24. 24.
    Ran Y, Jain N, Yalkowsky SH (2001) J Chem Inf Comput Sci 41:1208CrossRefGoogle Scholar
  25. 25.
    Tetko IV, Tanchuk VY, Kasheva TN, Villa AEP (2001) J Chem Inf Comput Sci 41:1488CrossRefGoogle Scholar
  26. 26.
    Yan A, Gasteiger J (2003) QSAR Comb Sci 22:821CrossRefGoogle Scholar
  27. 27.
    Livingstone DJ, Martyn F, Huuskonenc JJ, Salt DW (2001) J Comput-Aided Mol Des 15:741CrossRefGoogle Scholar
  28. 28.
    Todeschini R, Consonni V, Mauri A, Pavan M, Dragon for windows and linux 2006. URL (accessed 14 May 2006)Google Scholar
  29. 29.
    Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U, Heinrich N, Muller K-R (2007) Chem Med Chem
  30. 30.
    Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U, Heinrich N, Müller K-R (2007)
  31. 31.
    O’Hagan A (1978) J R Stat Soc Ser B: Methodological 40:1Google Scholar
  32. 32.
    Rasmussen CE, Williams CKI (2005) Gaussian Processes for machine learning. MIT PressGoogle Scholar
  33. 33.
    Schölkopf B, Smola AJ (2002) Learning with kernels. MIT PressGoogle Scholar
  34. 34.
    Müller K-R, Mika S, Rätsch G, Tsuda K, Schölkopf B (2001) IEEE Trans Neural Netw 12:181CrossRefGoogle Scholar
  35. 35.
    Vapnik VN (1998) Statistical learning theory. Wiley, New YorkGoogle Scholar
  36. 36.
    Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge, UKGoogle Scholar
  37. 37.
    Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge MAGoogle Scholar
  38. 38.
    Wang G, Yeung D-Y, Lochovsky FH (2006) Two-dimensional solution path for support vector regression. In: De Raedt L, Wrobel S (eds) Proceedings of ICML06, ACM Press, pp 993 URL Scholar
  39. 39.
    Breiman L (2001) Mach Learn 45:5 URL Scholar
  40. 40.
    Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference and prediction. Springer series in statistics. Springer, New York, NYGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2007

Authors and Affiliations

  • Timon Sebastian Schroeter
    • 1
    • 2
  • Anton Schwaighofer
    • 1
  • Sebastian Mika
    • 3
  • Antonius Ter Laak
    • 4
  • Detlev Suelzle
    • 4
  • Ursula Ganzer
    • 4
  • Nikolaus Heinrich
    • 4
  • Klaus-Robert Müller
    • 1
    • 2
  1. 1.Fraunhofer FIRSTBerlinGermany
  2. 2.Department of Computer ScienceTechnical University of BerlinBerlinGermany
  3. 3.idalab GmbHBerlinGermany
  4. 4.Research Laboratories of Bayer Schering Pharma AGBerlinGermany

Personalised recommendations