Skip to main content
Log in

Multi-channel GCN ensembled machine learning model for molecular aqueous solubility prediction on a clean dataset

  • Original Article
  • Published:
Molecular Diversity Aims and scope Submit manuscript

Abstract

This study constructed a new aqueous solubility dataset and a solubility regression model which was ensembled by GCN and machine learning models. Aqueous solubility is a key physiochemical property of small molecules in drug discovery. In the past few decades, there have been many studies about solubility prediction. However, many of these studies have high root mean squared error (RMSE). Meanwhile, their dataset always contains salt compounds and solubility data obtained from different experimental conditions. In this paper, we constructed a clean dataset with 2609 compounds, which was small but contains only solubility records without salts at the same temperatures (25 °C). Here, we applied graph convolutional neural network (GCN) to construct an aqueous solubility prediction model. To enhance the performance of the model, the molecular MACCS key fingerprints and physiochemical descriptors were also combined with the GCN model to build a multi-channel model. Additionally, the authors also built two machine learning models (support vector regression and gradient boost decision tree) and assembled them to the GCN model to improve the root mean squared error (RMSE = 0.665). Finally, comparative experiments have shown that our framework achieved the best performance on ESOL dataset (RMSEval = 0.56, RMSEtest = 0.44) and surpassed four established software on aqueous solubility prediction of new compounds.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Kalo Z, Petyko ZI, Fricke FU, Maniadakis N, Tesar T, Podrazilova K, Espin J, Inotai A (2021) Development of a core evaluation framework of value-added medicines: report 2 on pharmaceutical policy perspectives. Cost Eff Resour Alloc 19:42. https://doi.org/10.1186/s12962-021-00296-2

    Article  PubMed  PubMed Central  Google Scholar 

  2. Hingorani AD, Kuan V, Finan C, Kruger FA, Gaulton A, Chopade S, Sofat R, MacAllister RJ, Overington JP, Hemingway H, Denaxas S, Prieto D, Casas JP (2019) Improving the odds of drug development success through human genomics: modelling study. Sci Rep 9:18911. https://doi.org/10.1038/s41598-019-54849-w

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Boobier S, Hose DRJ, Blacker AJ, Nguyen BN (2020) Machine learning with physicochemical relationships: solubility prediction in organic solvents and water. Nat Commun. https://doi.org/10.1038/s41467-020-19594-z

    Article  PubMed  PubMed Central  Google Scholar 

  4. Murdande SB, Pikal MJ, Shanker RM, Bogner RH (2011) Aqueous solubility of crystalline and amorphous drugs: challenges in measurement. Pharm Dev Technol 16:187–200. https://doi.org/10.3109/10837451003774377

    Article  CAS  PubMed  Google Scholar 

  5. Raevsky OA, Grigorev VY, Polianczyk DE, Raevskaja OE, Dearden JC (2019) Aqueous drug solubility: What do we measure, calculate and QSPR predict? Mini Rev Med Chem 19:362–372. https://doi.org/10.2174/1389557518666180727164417

    Article  CAS  PubMed  Google Scholar 

  6. Alelyunas YW, Empfield JR, McCarthy D, Spreen RC, Bui K, Pelosi-Kilby L, Shen C (2010) Experimental solubility profiling of marketed CNS drugs, exploring solubility limit of CNS discovery candidate. Bioorg Med Chem Lett 20:7312–7316. https://doi.org/10.1016/j.bmcl.2010.10.068

    Article  CAS  PubMed  Google Scholar 

  7. Boobier S, Osbourn A, Mitchell JBO (2017) Can human experts predict solubility better than computers? J Cheminform 9:63. https://doi.org/10.1186/s13321-017-0250-y

    Article  PubMed  PubMed Central  Google Scholar 

  8. Palmer DS, Mitchell JB (2014) Is experimental data quality the limiting factor in predicting the aqueous solubility of druglike molecules? Mol Pharm 11:2962–2972. https://doi.org/10.1021/mp500103r

    Article  CAS  PubMed  Google Scholar 

  9. Ran Y, Yalkowsky SH (2001) Prediction of drug structure by the general solubility equation (GSE). J Chem Inf Comput Sci 41:354–357. https://doi.org/10.1021/ci000338c

    Article  CAS  PubMed  Google Scholar 

  10. Delaney JS (2004) ESOL: Estimating aqueous solubility directly from molecular structure. J Chem Inf Comput Sci 44:1000–1005. https://doi.org/10.1021/ci034243x

    Article  CAS  PubMed  Google Scholar 

  11. Palmer DS, O’Boyle NM, Glen RC, Mitchell JBO (2007) Random forest models to predict aqueous solubility. J Chem Inf Model 47:150–158. https://doi.org/10.1021/ci060164k

    Article  CAS  PubMed  Google Scholar 

  12. Lusci A, Pollastri G, Baldi P (2013) Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J Chem Inf Model 53:1563–1575. https://doi.org/10.1021/ci400187y

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Cui Q, Lu S, Ni B, Zeng X, Tan Y, Chen YD, Zhao H (2020) Improved prediction of aqueous solubility of novel compounds by going deeper with deep learning. Front Oncol 10:121. https://doi.org/10.3389/fonc.2020.00121

    Article  PubMed  PubMed Central  Google Scholar 

  14. Cheng AL, Merz KM (2003) Prediction of aqueous solubility of a diverse set of compounds using quantitative structure-property relationships. J Med Chem 46:3572–3580. https://doi.org/10.1021/jm020266b

    Article  CAS  PubMed  Google Scholar 

  15. Schrodinger. https://www.schrodinger.com/products/QikProp

  16. Chevillard F, Lagorce D, Reynes C, Villoutreix BO, Vayer P, Miteva MA (2012) In silico prediction of aqueous solubility: a multimodel protocol based on chemical similarity. Mol Pharm 9:3127–3135. https://doi.org/10.1021/mp300234q

    Article  CAS  PubMed  Google Scholar 

  17. Sun HM, Shah P, Nguyen K, Yu KR, Kerns E, Kabir M, Wang YH, Xu X (2019) Predictive models of aqueous solubility of organic compounds built on a large dataset of high integrity. Bioorgan Med Chem 27:3110–3114. https://doi.org/10.1016/j.bmc.2019.05.037

    Article  CAS  Google Scholar 

  18. Sorkun MC, Khetan A, Er S (2019) AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci Data. https://doi.org/10.1038/s41597-019-0151-1

    Article  PubMed  PubMed Central  Google Scholar 

  19. Francoeur PG, Koes DR (2021) Soltrannet-a machine learning tool for fast aqueous solubility prediction. J Chem Inf Model 61:2530–2536

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Mordelet F, Vert JP (2014) A bagging SVM to learn from positive and unlabeled examples. Pattern Recogn Lett 37:201–209. https://doi.org/10.1016/j.patrec.2013.06.010

    Article  Google Scholar 

  21. Tomasulo P (2002) ChemIDplus-super source for chemical and drug information. Med Ref Serv Q 21:53–59. https://doi.org/10.1300/J115v21n01_04

    Article  PubMed  Google Scholar 

  22. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36. https://doi.org/10.1021/ci00057a005

    Article  CAS  Google Scholar 

  23. Landrum G (2019) RDKit: open-source cheminformatics from machine learning to chemical registration. Abstr Pap Am Chem S 258

  24. Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comp Sci 42:1273–1280. https://doi.org/10.1021/ci010132r

    Article  CAS  Google Scholar 

  25. Estevez PA, Tesmer M, Perez CA, Zurada JA (2009) Normalized mutual information feature selection. Ieee T Neural Networ 20:189–201. https://doi.org/10.1109/TNN.2008.2005601

    Article  Google Scholar 

  26. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: ICML

  27. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980

  28. Zhu YT, Brettin T, Evrard YA, Partin A, Xia FF, Shukla M, Yoo H, Doroshow JH, Stevens RL (2020) Ensemble transfer learning for the prediction of anti-cancer drug response. Sci Rep. https://doi.org/10.1038/s41598-020-74921-0

    Article  PubMed  PubMed Central  Google Scholar 

  29. Wei GF, Li Y, Zhang ZT, Chen YW, Chen JY, Yao ZH, Lao CC, Chen HF (2020) Estimation of soil salt content by combining UAV-borne multispectral sensor and machine learning algorithms. PeerJ. https://doi.org/10.7717/peerj.9087

    Article  PubMed  PubMed Central  Google Scholar 

  30. Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J, G´omez-Bombarelli R, Hirzel T, Aspuru-Guzik A, Adams RP (2015) Convolutional networks on graphs for learning molecular fingerprints. https://arxiv.org/abs/1509.09292

  31. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. https://arxiv.org/abs/1704.01212

  32. Wu F, Zhang T, Souza A, Fifty C, Yu T, Weinberger KQ (2019) Simplifying graph convolutional networks.In: ICML 6861–6871

  33. Velickovic P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2018) Graph attention networks. https://arxiv.org/abs/1710.10903

  34. Thekumparampil KK, Wang C, Oh S, Li LJ (2018) Attention-based graph neural network for semi-supervised learning. https://arxiv.org/abs/1803.03735

  35. Bianchi F.M, Grattarola D, Alippi C, Livi L (2019) Graph neural networks with convolutional ARMA filters.https://arxiv.org/abs/1901.01343

  36. Wang X, Li Z, Jiang M, Wang S, Zhang S, Wei Z (2019) Molecule property prediction based on spatial graph embedding. J Chem Inf Model 59:3817–3828. https://doi.org/10.1021/acs.jcim.9b00410

    Article  CAS  PubMed  Google Scholar 

  37. Bachovchin KA, Sharma A, Bag S, Klug DM, Schneider KM, Singh B, Jalani HB, Buskes MJ, Mehta N, Tanghe S (2018) Improvement of aqueous solubility of lapatinib-derived analogues: identification of a quinolinimine lead for human African trypanosomiasis drug development. J Med Chem 62:665–687. https://doi.org/10.1021/acs.jmedchem.8b01365

    Article  CAS  Google Scholar 

  38. Li C, Chen C, An Q, Yang T, Sang Z, Yang Y, Ju Y, Tong A, Luo Y (2019) A novel series of napabucasin derivatives as orally active inhibitors of signal transducer and activator of transcription 3 (STAT3). Eur J Med Chem 162:543–554. https://doi.org/10.1016/j.ejmech.2018.10.067

    Article  CAS  PubMed  Google Scholar 

  39. Yao X, Sun X, Jin S, Yang L, Xu H, Rao Y (2019) Discovery of 4-aminoquinoline-3-carboxamide derivatives as potent reversible Bruton’s Tyrosine kinase inhibitors for the treatment of rheumatoid arthritis. J Med Chem 62:6561–6574. https://doi.org/10.1021/acs.jmedchem.9b00329

    Article  CAS  PubMed  Google Scholar 

  40. Alvarez R, Aramburu L, Gajate C, Vicente-Blazquez A, Mollinedo F, Medarde M, Pelaez R (2020) Potent colchicine-site ligands with improved intrinsic solubility by replacement of the 3, 4, 5-trimethoxyphenyl ring with a 2-methylsulfanyl-6-methoxypyridine ring. Bioorg Chem 98:103755. https://doi.org/10.1016/j.bioorg.2020.103755

    Article  CAS  PubMed  Google Scholar 

  41. Amaradhi R, Banik A, Mohammed S, Patro V, Rojas A, Wang W, Motati DR, Dingledine R, Ganesh T (2020) Potent, selective, water soluble, brain-permeable EP2 receptor antagonist for use in central nervous system disease models. J Med Chem 63:1032–1050. https://doi.org/10.1021/acs.jmedchem.9b01218

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Rynearson KD, Buckle RN, Herr RJ, Mayhew NJ, Chen X, Paquette WD, Sakwa SA, Yang J, Barnes KD, Nguyen P, Mobley WC, Johnson G, Lin JH, Tanzi RE, Wagner SL (2020) Design and synthesis of novel methoxypyridine-derived gamma-secretase modulators. Bioorg Med Chem 28:115734. https://doi.org/10.1016/j.bmc.2020.115734

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was financially supported by National Natural Science Foundation of China (Nos. 81973182, 81803370 and 82073704), Natural Science Foundation of Jiangsu Province (No. BK20180559), State Key Laboratory Innovation Research and Cultivation Fund (No. SKLNMZZCX201812), and “Double World-classes” Construction Program of China Pharmaceutical University (No. CPU2018GF02).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yadong Chen or Haichun Liu.

Ethics declarations

Conflict of interest

The authors declare no competing financial interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 228 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, C., Liang, L., Xing, G. et al. Multi-channel GCN ensembled machine learning model for molecular aqueous solubility prediction on a clean dataset. Mol Divers 27, 1023–1035 (2023). https://doi.org/10.1007/s11030-022-10465-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11030-022-10465-x

Keywords

Navigation