Abstract
We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.
Similar content being viewed by others
Notes
For some compounds, experimental values for both solubility and log D where available. For these compounds, we used log D predictions generated using a cross-validation procedure. This means that the predictions were always made using a log D model that has not been trained using the experimental log D value for the respective compound. This is necessary to avoid over-optimistic predictions.
It has been suggested to use numeric criteria, such as log probability of the predictive distribution, for this purpose. Our experience suggests that these criteria can be misleading, they thus have not been used. In particular, log probability tends to favor over-optimistic models.
References
Schwaighofer A, Schroeter T, Mika S, Laub J, ter Laak A, Sülzle D, Ganzer U, Heinrich N, Müller K-R (2007) J Chem Inf Model 47:407 URL http://dx.doi.org/10.1021/ci600205
Balakin KV, Savchuk NP, Tetko IV (2006) Curr Med Chem 13:223
Johnson SR, Zheng W (2006) The AAPS J 8:E27 URL http://www.aapsj.org/articles/aapsj0801/aapsj080104/aapsj080104.pdf
Göller AH, Matthias H, Jörg K, Timothy C (2006) J Chem Inf Model 46:648
Delaney JS (2005) Drug Discovery Today 10:289
Goldman BB, Walters WP (2006) Machine learning in computational chemistry, vol 2, chapter 8, Elsevier, pp 127
Netzeva TI, Worth AP, Aldenberg T, Benigni R, Cronin MTD, Gramatica P, Jaworska JS, Kahn S, Klopman G, Marchant CA, Myatt G, Nikolova-Jeliazkova N, Patlewicz GY, Perkins R, Roberts DW, Schultz TW, Stanton DT, van de Sandt JJM, Tong W, Veith G, Yang C (2005) Altern Lab Anim 33:1
Tetko IV, Bruneau P, Mewes H-W, Rohrer DC, Poda GI (2006) Drug Discovery Today 11:700
Tropsha A (2006) Variable selection qsar modeling, model validation, and virtual screening. In: Spellmeyer DC (ed) Annual reports in computational chemistry, vol 2, chapter 7, Elsevier, pp 113
Bruneau P, McElroy NR (2004) J Chem Inf Model 44:1912
Tong W, Xie Q, Hong U, Shi L, Fang H, Perkins R (2004) Environ Health Perspect 112:1249
Bruneau P, McElroy NR (2006) J Chem Inf Model 46:1379
Silverman BW (1986) Density estimation for statistics and data analysis. Number 26 in Monographs on Statistics and Applied Probability. Chapman & Hall
Manallack DT, Tehan BG, Gancia E, Hudson BD, Ford MG, Livingstone DJ, Whitley DC, Pitt WR (2003) J Chem Inf Model 43:674
Kühne R, Ebert R-U, Schüürmann G (2006) J Chem Inf Model 46:636
Bender A, Mussa HY, Glen RC (2005) J Biomol Screen 10:658 http://jbx.sagepub.com/cgi/content/abstract/10/7/658
Sun H (2006) Chem Med Chem 1:315
Sadowski J, Schwab C, Gasteiger J Corina v3.1. Erlangen, Germany
Todeschini R, Consonni V, Mauri A, Pavan M DRAGON v1.2. Milano, Italy
Physical/Chemical Property Database (PHYSPROP). Syracuse, NY, USA
Beilstein CrossFire Database. San Ramon, CA, USA
Yalkowsky SH, Dannelfelser RM The arizona database of aqueous solubility. Tuscon, AZ, USA
Huuskonen J (2000) J Chem Inf Comput Sci 40:773
Ran Y, Jain N, Yalkowsky SH (2001) J Chem Inf Comput Sci 41:1208
Tetko IV, Tanchuk VY, Kasheva TN, Villa AEP (2001) J Chem Inf Comput Sci 41:1488
Yan A, Gasteiger J (2003) QSAR Comb Sci 22:821
Livingstone DJ, Martyn F, Huuskonenc JJ, Salt DW (2001) J Comput-Aided Mol Des 15:741
Todeschini R, Consonni V, Mauri A, Pavan M, Dragon for windows and linux 2006. URL http://www.talete.mi.it/help/dragon_help/ (accessed 14 May 2006)
Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U, Heinrich N, Muller K-R (2007) Chem Med Chem http://dx.doi.org/10.1002/cmdc.200700041
Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U, Heinrich N, Müller K-R (2007) http://dx.doi.org/10.1021/mp0700413
O’Hagan A (1978) J R Stat Soc Ser B: Methodological 40:1
Rasmussen CE, Williams CKI (2005) Gaussian Processes for machine learning. MIT Press
Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press
Müller K-R, Mika S, Rätsch G, Tsuda K, Schölkopf B (2001) IEEE Trans Neural Netw 12:181
Vapnik VN (1998) Statistical learning theory. Wiley, New York
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge, UK
Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge MA
Wang G, Yeung D-Y, Lochovsky FH (2006) Two-dimensional solution path for support vector regression. In: De Raedt L, Wrobel S (eds) Proceedings of ICML06, ACM Press, pp 993 URL http://www.icml2006.org/icml_documents/camera-ready/125_Two_Dimensional_Solu.pdf
Breiman L (2001) Mach Learn 45:5 URL http://dx.doi.org/10.1023/A:1010933404324
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference and prediction. Springer series in statistics. Springer, New York, NY
Acknowledgements
The authors gratefully acknowledge partial support from the PASCAL Network of Excellence (EU #506778) and DFG grant MU 987/4-1. We thank Vincent Schütz and Carsten Jahn for maintaining the PCADMET database, and Gilles Blanchard for implementing the random forest method as part of our machine learning toolbox.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Schroeter, T.S., Schwaighofer, A., Mika, S. et al. Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules. J Comput Aided Mol Des 21, 485–498 (2007). https://doi.org/10.1007/s10822-007-9125-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-007-9125-z