Skip to main content

Stacking Gaussian processes to improve \(pK_a\) predictions in the SAMPL7 challenge


Accurate predictions of acid dissociation constants are essential to rational molecular design in the pharmaceutical industry and elsewhere. There has been much interest in developing new machine learning methods that can produce fast and accurate pKa predictions for arbitrary species, as well as estimates of prediction uncertainty. Previously, as part of the SAMPL6 community-wide blind challenge, Bannan et al. approached the problem of predicting \(pK_{a}\)s by using a Gaussian process regression to predict microscopic \(pK_{a}\)s, from which macroscopic \(pK_{a}\) values can be analytically computed (Bannan et al. in J Comput-Aided Mol Des 32:1165–1177). While this method can make reasonably quick and accurate predictions using a small training set, accuracy was limited by the lack of a sufficiently broad range of chemical space in the training set (e.g., the inclusion of polyprotic acids). Here, to address this issue, we construct a deep Gaussian Process (GP) model that can include more features without invoking the curse of dimensionality. We trained both a standard GP and a deep GP model using a database of approximately 3500 small molecules curated from public sources, filtered by similarity to targets. We tested the model on both the SAMPL6 and more recent SAMPL7 challenge, which introduced a similar lack of ionizable sites and/or environments found between the test set and the previous training set. The results show that while the deep GP model made only minor improvements over the standard GP model for SAMPL6 predictions, it made significant improvements over the standard GP model in SAMPL7 macroscopic predictions, achieving a MAE of 1.5 \(pK_{a}\).

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. Gleeson MP (2008) Generation of a set of simple, interpretable ADMET rules of thumb. J Med Chem 51:817–834

    CAS  Article  Google Scholar 

  2. Manallack DT, Prankerd RJ, Yuriev E, Oprea TI, Chalmers DK (2013) The significance of acid/base properties in drug discovery. Chem Soc Rev 42:485–496

    CAS  Article  Google Scholar 

  3. SAMPL Challenge. Accessed 1 Aug 2021

  4. Işık M, Bergazin TD, Fox T, Rizzi A, Chodera JD, Mobley DL (2020) Assessing the accuracy of octanol-water partition coefficient predictions in the SAMPL6 Part II log P challenge. J Comput-Aided Mol Des 34:1–36

    Article  Google Scholar 

  5. Fraczkiewicz R, Lobell M, Goller AH, Krenz U, Schoenneis R, Clark RD, Hillisch A (2015) Best of both worlds: combining pharma data and state of the art modeling technology to improve in silico p K a prediction. J Chem Inf Model 55:389–397

    CAS  Article  Google Scholar 

  6. Shields GC, Seybold PG (2013) Computational approaches for the prediction of pKa values. CRC Press, Boca Raton

    Book  Google Scholar 

  7. Fraczkiewicz R (2013) In silico prediction of ionization. Elsevier, Amsterdam

    Book  Google Scholar 

  8. Bannan CC, Mobley DL, Skillman AG (2018) SAMPL6 challenge results from \(pK_a\) predictions based on a general Gaussian process model. J Comput Aided Mol Des 32:1165–1177

    CAS  Article  Google Scholar 

  9. pKa-Prospector OpenEye Scientific Software, Santa Fe, NM. Accessed 1 Aug 2021

  10. Gunner MR, Murakami T, Rustenburg AS, Işık M, Chodera JD (2020) Standard state free energies, not pK as, are ideal for describing small molecule protonation and tautomeric states. J Comput-Aided Mol Des 34:1–13

    Article  Google Scholar 

  11. Halgren TA (1996) Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J Comput Chem 17:490–519

    CAS  Article  Google Scholar 

  12. Jakalian A, Jack DB, Bayly CI (2002) Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II. Parameterization and validation. J Comput Chem 23:1623–1641

    CAS  Article  Google Scholar 

  13. Wagner J et al. (2020) openforcefield/openforcefield: 0.8.0 virtual sites and bond interpolation.

  14. Landrum G (2006) RDKit: Open-source cheminformatics

  15. Software os cheminformatics software: molecular modeling software. OpenEye Scientific. Accessed 1 Aug 2021

  16. Shrake A, Rupley JA (1973) Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J Mol Biol 79:351–371

    CAS  Article  Google Scholar 

  17. Xing L, Glen RC, Clark RD (2003) Predicting p K a by molecular tree structured fingerprints and PLS. J Chem Inf Comput Sci 43:870–879

    CAS  Article  Google Scholar 

  18. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754

    CAS  Article  Google Scholar 

  19. GPy (2012) GPy: a Gaussian process framework in python. Accessed 1 Aug 2021

  20. Damianou A, Lawrence N (2013) Deep gaussian processes. In: Artificial intelligence and statistics, pp 207–215

  21. Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  22. Duvenaud D (2014) The Kernel cookbook: advice on covariance functions. Accessed 1 Aug 2021

  23. Yang Q, Li Y, Yang J-D, Liu Y, Zhang L, Luo S, Cheng J-P (2020) Holistic prediction of pKa in diverse solvents based on machine learning approach. Angew Chem 132(43):19444–19453

    Article  Google Scholar 

  24. Raddi R, Voelz V (2021) pKa database for stacking Gaussian Processes to improve pKa predictions in the SAMPL7 challenge. ChemRxiv.

    Article  Google Scholar 

  25. Sushko I et al (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput-Aided Mol Des 25:533–554

    CAS  Article  Google Scholar 

  26. Wishart DS et al (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res 46:D1074–D1082

    CAS  Article  Google Scholar 

  27. Settimo L, Bellman K, Knegtel RM (2014) Comparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm Res 31:1082–1095

    CAS  Article  Google Scholar 

  28. Titsias M (2009) Variational learning of inducing variables in sparse Gaussian processes. Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, PMLR 5:567–574.

  29. Francisco KR, Varricchio C, Paniak TJ, Kozlowski MC, Brancale A, Ballatore C (2021) Structure property relationships of N-acylsulfonamides and related bioisosteres. Eur J Med Chem 218:113399

    CAS  Article  Google Scholar 

  30. Caine BA, Bronzato M, Popelier PL (2019) Experiment stands corrected: accurate prediction of the aqueous p K a values of sulfonamide drugs using equilibrium bond lengths. Chem Sci 10:6368–6381

    CAS  Article  Google Scholar 

  31. Nigam A, Pollice R, Hurley M, FD, Hickman RJ, Aldeghi M, Yoshikawa N, Chithrananda S, Voelz VA, Aspuru-Guzik A (2021) Assigning confidence to molecular property prediction. Expert Opin Drug Discovery.

Download references


RMR and VAV are supported by National Institutes of Health Grant R01GM123296. We appreciate the National Institutes of Health for its support of the SAMPL project via R01GM124270 to David L. Mobley (UC Irvine)

Author information

Authors and Affiliations


Corresponding author

Correspondence to Vincent A. Voelz.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 767 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Raddi, R.M., Voelz, V.A. Stacking Gaussian processes to improve \(pK_a\) predictions in the SAMPL7 challenge. J Comput Aided Mol Des 35, 953–961 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • SAMPL7 physical property prediction
  • Acid dissociation constants
  • Gaussian process models
  • Machine learning
  • Physicochemical properties
  • Computational drug design