Abstract
Accurate predictions of acid dissociation constants are essential to rational molecular design in the pharmaceutical industry and elsewhere. There has been much interest in developing new machine learning methods that can produce fast and accurate pKa predictions for arbitrary species, as well as estimates of prediction uncertainty. Previously, as part of the SAMPL6 community-wide blind challenge, Bannan et al. approached the problem of predicting \(pK_{a}\)s by using a Gaussian process regression to predict microscopic \(pK_{a}\)s, from which macroscopic \(pK_{a}\) values can be analytically computed (Bannan et al. in J Comput-Aided Mol Des 32:1165–1177). While this method can make reasonably quick and accurate predictions using a small training set, accuracy was limited by the lack of a sufficiently broad range of chemical space in the training set (e.g., the inclusion of polyprotic acids). Here, to address this issue, we construct a deep Gaussian Process (GP) model that can include more features without invoking the curse of dimensionality. We trained both a standard GP and a deep GP model using a database of approximately 3500 small molecules curated from public sources, filtered by similarity to targets. We tested the model on both the SAMPL6 and more recent SAMPL7 challenge, which introduced a similar lack of ionizable sites and/or environments found between the test set and the previous training set. The results show that while the deep GP model made only minor improvements over the standard GP model for SAMPL6 predictions, it made significant improvements over the standard GP model in SAMPL7 macroscopic predictions, achieving a MAE of 1.5 \(pK_{a}\).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Gleeson MP (2008) Generation of a set of simple, interpretable ADMET rules of thumb. J Med Chem 51:817–834
Manallack DT, Prankerd RJ, Yuriev E, Oprea TI, Chalmers DK (2013) The significance of acid/base properties in drug discovery. Chem Soc Rev 42:485–496
SAMPL Challenge. https://www.samplchallenges.org. Accessed 1 Aug 2021
Işık M, Bergazin TD, Fox T, Rizzi A, Chodera JD, Mobley DL (2020) Assessing the accuracy of octanol-water partition coefficient predictions in the SAMPL6 Part II log P challenge. J Comput-Aided Mol Des 34:1–36
Fraczkiewicz R, Lobell M, Goller AH, Krenz U, Schoenneis R, Clark RD, Hillisch A (2015) Best of both worlds: combining pharma data and state of the art modeling technology to improve in silico p K a prediction. J Chem Inf Model 55:389–397
Shields GC, Seybold PG (2013) Computational approaches for the prediction of pKa values. CRC Press, Boca Raton
Fraczkiewicz R (2013) In silico prediction of ionization. Elsevier, Amsterdam
Bannan CC, Mobley DL, Skillman AG (2018) SAMPL6 challenge results from \(pK_a\) predictions based on a general Gaussian process model. J Comput Aided Mol Des 32:1165–1177
pKa-Prospector 1.1.5.1: OpenEye Scientific Software, Santa Fe, NM. http://www.eyesopen.com. Accessed 1 Aug 2021
Gunner MR, Murakami T, Rustenburg AS, Işık M, Chodera JD (2020) Standard state free energies, not pK as, are ideal for describing small molecule protonation and tautomeric states. J Comput-Aided Mol Des 34:1–13
Halgren TA (1996) Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J Comput Chem 17:490–519
Jakalian A, Jack DB, Bayly CI (2002) Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II. Parameterization and validation. J Comput Chem 23:1623–1641
Wagner J et al. (2020) openforcefield/openforcefield: 0.8.0 virtual sites and bond interpolation. https://doi.org/10.5281/zenodo.4121930
Landrum G (2006) RDKit: Open-source cheminformatics
Software os cheminformatics software: molecular modeling software. OpenEye Scientific. http://www.eyesopen.com. Accessed 1 Aug 2021
Shrake A, Rupley JA (1973) Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J Mol Biol 79:351–371
Xing L, Glen RC, Clark RD (2003) Predicting p K a by molecular tree structured fingerprints and PLS. J Chem Inf Comput Sci 43:870–879
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754
GPy (2012) GPy: a Gaussian process framework in python. http://github.com/SheffieldML/GPy. Accessed 1 Aug 2021
Damianou A, Lawrence N (2013) Deep gaussian processes. In: Artificial intelligence and statistics, pp 207–215
Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Duvenaud D (2014) The Kernel cookbook: advice on covariance functions. https://www.cs.toronto.edu/duvenaud/cookbook. Accessed 1 Aug 2021
Yang Q, Li Y, Yang J-D, Liu Y, Zhang L, Luo S, Cheng J-P (2020) Holistic prediction of pKa in diverse solvents based on machine learning approach. Angew Chem 132(43):19444–19453
Raddi R, Voelz V (2021) pKa database for stacking Gaussian Processes to improve pKa predictions in the SAMPL7 challenge. ChemRxiv. https://doi.org/10.5281/zenodo.5027418
Sushko I et al (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput-Aided Mol Des 25:533–554
Wishart DS et al (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res 46:D1074–D1082
Settimo L, Bellman K, Knegtel RM (2014) Comparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm Res 31:1082–1095
Titsias M (2009) Variational learning of inducing variables in sparse Gaussian processes. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, PMLR 5:567–574.
Francisco KR, Varricchio C, Paniak TJ, Kozlowski MC, Brancale A, Ballatore C (2021) Structure property relationships of N-acylsulfonamides and related bioisosteres. Eur J Med Chem 218:113399
Caine BA, Bronzato M, Popelier PL (2019) Experiment stands corrected: accurate prediction of the aqueous p K a values of sulfonamide drugs using equilibrium bond lengths. Chem Sci 10:6368–6381
Nigam A, Pollice R, Hurley M, FD, Hickman RJ, Aldeghi M, Yoshikawa N, Chithrananda S, Voelz VA, Aspuru-Guzik A (2021) Assigning confidence to molecular property prediction. Expert Opin Drug Discovery. https://doi.org/10.1080/17460441.2021.1925247
Acknowledgements
RMR and VAV are supported by National Institutes of Health Grant R01GM123296. We appreciate the National Institutes of Health for its support of the SAMPL project via R01GM124270 to David L. Mobley (UC Irvine)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Raddi, R.M., Voelz, V.A. Stacking Gaussian processes to improve \(pK_a\) predictions in the SAMPL7 challenge. J Comput Aided Mol Des 35, 953–961 (2021). https://doi.org/10.1007/s10822-021-00411-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-021-00411-8