Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules

Schroeter, Timon Sebastian; Schwaighofer, Anton; Mika, Sebastian; Ter Laak, Antonius; Suelzle, Detlev; Ganzer, Ursula; Heinrich, Nikolaus; Müller, Klaus-Robert

doi:10.1007/s10822-007-9125-z

Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules

Published: 14 July 2007

Volume 21, pages 485–498, (2007)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Timon Sebastian Schroeter^1,2,
Anton Schwaighofer¹,
Sebastian Mika³,
Antonius Ter Laak⁴,
Detlev Suelzle⁴,
Ursula Ganzer⁴,
Nikolaus Heinrich⁴ &
…
Klaus-Robert Müller^1,2

973 Accesses
33 Citations
Explore all metrics

Abstract

We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Article 19 April 2016

Dynamic light scattering: a practical guide and applications in biomedical sciences

Article 06 October 2016

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Notes

For some compounds, experimental values for both solubility and log D where available. For these compounds, we used log D predictions generated using a cross-validation procedure. This means that the predictions were always made using a log D model that has not been trained using the experimental log D value for the respective compound. This is necessary to avoid over-optimistic predictions.
It has been suggested to use numeric criteria, such as log probability of the predictive distribution, for this purpose. Our experience suggests that these criteria can be misleading, they thus have not been used. In particular, log probability tends to favor over-optimistic models.

References

Schwaighofer A, Schroeter T, Mika S, Laub J, ter Laak A, Sülzle D, Ganzer U, Heinrich N, Müller K-R (2007) J Chem Inf Model 47:407 URL http://dx.doi.org/10.1021/ci600205
Google Scholar
Balakin KV, Savchuk NP, Tetko IV (2006) Curr Med Chem 13:223
Article CAS Google Scholar
Johnson SR, Zheng W (2006) The AAPS J 8:E27 URL http://www.aapsj.org/articles/aapsj0801/aapsj080104/aapsj080104.pdf
Göller AH, Matthias H, Jörg K, Timothy C (2006) J Chem Inf Model 46:648
Article CAS Google Scholar
Delaney JS (2005) Drug Discovery Today 10:289
Article CAS Google Scholar
Goldman BB, Walters WP (2006) Machine learning in computational chemistry, vol 2, chapter 8, Elsevier, pp 127
Netzeva TI, Worth AP, Aldenberg T, Benigni R, Cronin MTD, Gramatica P, Jaworska JS, Kahn S, Klopman G, Marchant CA, Myatt G, Nikolova-Jeliazkova N, Patlewicz GY, Perkins R, Roberts DW, Schultz TW, Stanton DT, van de Sandt JJM, Tong W, Veith G, Yang C (2005) Altern Lab Anim 33:1
Google Scholar
Tetko IV, Bruneau P, Mewes H-W, Rohrer DC, Poda GI (2006) Drug Discovery Today 11:700
Article CAS Google Scholar
Tropsha A (2006) Variable selection qsar modeling, model validation, and virtual screening. In: Spellmeyer DC (ed) Annual reports in computational chemistry, vol 2, chapter 7, Elsevier, pp 113
Bruneau P, McElroy NR (2004) J Chem Inf Model 44:1912
Article CAS Google Scholar
Tong W, Xie Q, Hong U, Shi L, Fang H, Perkins R (2004) Environ Health Perspect 112:1249
CAS Google Scholar
Bruneau P, McElroy NR (2006) J Chem Inf Model 46:1379
Article CAS Google Scholar
Silverman BW (1986) Density estimation for statistics and data analysis. Number 26 in Monographs on Statistics and Applied Probability. Chapman & Hall
Manallack DT, Tehan BG, Gancia E, Hudson BD, Ford MG, Livingstone DJ, Whitley DC, Pitt WR (2003) J Chem Inf Model 43:674
CAS Google Scholar
Kühne R, Ebert R-U, Schüürmann G (2006) J Chem Inf Model 46:636
Article CAS Google Scholar
Bender A, Mussa HY, Glen RC (2005) J Biomol Screen 10:658 http://jbx.sagepub.com/cgi/content/abstract/10/7/658
Google Scholar
Sun H (2006) Chem Med Chem 1:315
CAS Google Scholar
Sadowski J, Schwab C, Gasteiger J Corina v3.1. Erlangen, Germany
Todeschini R, Consonni V, Mauri A, Pavan M DRAGON v1.2. Milano, Italy
Physical/Chemical Property Database (PHYSPROP). Syracuse, NY, USA
Beilstein CrossFire Database. San Ramon, CA, USA
Yalkowsky SH, Dannelfelser RM The arizona database of aqueous solubility. Tuscon, AZ, USA
Huuskonen J (2000) J Chem Inf Comput Sci 40:773
Article CAS Google Scholar
Ran Y, Jain N, Yalkowsky SH (2001) J Chem Inf Comput Sci 41:1208
Article CAS Google Scholar
Tetko IV, Tanchuk VY, Kasheva TN, Villa AEP (2001) J Chem Inf Comput Sci 41:1488
Article CAS Google Scholar
Yan A, Gasteiger J (2003) QSAR Comb Sci 22:821
Article CAS Google Scholar
Livingstone DJ, Martyn F, Huuskonenc JJ, Salt DW (2001) J Comput-Aided Mol Des 15:741
Article CAS Google Scholar
Todeschini R, Consonni V, Mauri A, Pavan M, Dragon for windows and linux 2006. URL http://www.talete.mi.it/help/dragon_help/ (accessed 14 May 2006)
Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U, Heinrich N, Muller K-R (2007) Chem Med Chem http://dx.doi.org/10.1002/cmdc.200700041
Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U, Heinrich N, Müller K-R (2007) http://dx.doi.org/10.1021/mp0700413
O’Hagan A (1978) J R Stat Soc Ser B: Methodological 40:1
Google Scholar
Rasmussen CE, Williams CKI (2005) Gaussian Processes for machine learning. MIT Press
Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press
Müller K-R, Mika S, Rätsch G, Tsuda K, Schölkopf B (2001) IEEE Trans Neural Netw 12:181
Article Google Scholar
Vapnik VN (1998) Statistical learning theory. Wiley, New York
Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge, UK
Google Scholar
Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge MA
Google Scholar
Wang G, Yeung D-Y, Lochovsky FH (2006) Two-dimensional solution path for support vector regression. In: De Raedt L, Wrobel S (eds) Proceedings of ICML06, ACM Press, pp 993 URL http://www.icml2006.org/icml_documents/camera-ready/125_Two_Dimensional_Solu.pdf
Breiman L (2001) Mach Learn 45:5 URL http://dx.doi.org/10.1023/A:1010933404324
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference and prediction. Springer series in statistics. Springer, New York, NY
Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge partial support from the PASCAL Network of Excellence (EU #506778) and DFG grant MU 987/4-1. We thank Vincent Schütz and Carsten Jahn for maintaining the PCADMET database, and Gilles Blanchard for implementing the random forest method as part of our machine learning toolbox.

Author information

Authors and Affiliations

Fraunhofer FIRST, Kekuléstraße 7, 12489, Berlin, Germany
Timon Sebastian Schroeter, Anton Schwaighofer & Klaus-Robert Müller
Department of Computer Science, Technical University of Berlin, Franklinstraße 28/29, 10587, Berlin, Germany
Timon Sebastian Schroeter & Klaus-Robert Müller
idalab GmbH, Sophienstraße 24, 10178, Berlin, Germany
Sebastian Mika
Research Laboratories of Bayer Schering Pharma AG, Müllerstraße 178, 13342, Berlin, Germany
Antonius Ter Laak, Detlev Suelzle, Ursula Ganzer & Nikolaus Heinrich

Authors

Timon Sebastian Schroeter
View author publications
You can also search for this author in PubMed Google Scholar
Anton Schwaighofer
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Mika
View author publications
You can also search for this author in PubMed Google Scholar
Antonius Ter Laak
View author publications
You can also search for this author in PubMed Google Scholar
Detlev Suelzle
View author publications
You can also search for this author in PubMed Google Scholar
Ursula Ganzer
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaus Heinrich
View author publications
You can also search for this author in PubMed Google Scholar
Klaus-Robert Müller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timon Sebastian Schroeter.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schroeter, T.S., Schwaighofer, A., Mika, S. et al. Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules. J Comput Aided Mol Des 21, 485–498 (2007). https://doi.org/10.1007/s10822-007-9125-z

Download citation

Received: 12 March 2007
Accepted: 11 June 2007
Published: 14 July 2007
Issue Date: September 2007
DOI: https://doi.org/10.1007/s10822-007-9125-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

Dynamic light scattering: a practical guide and applications in biomedical sciences

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

Dynamic light scattering: a practical guide and applications in biomedical sciences

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation