Skip to main content

Machine learning-based solubility prediction and methodology evaluation of active pharmaceutical ingredients in industrial crystallization


Solubility has been widely regarded as a fundamental property of small molecule drugs and drug candidates, as it has a profound impact on the crystallization process. Solubility prediction, as an alternative to experiments which can reduce waste and improve crystallization process efficiency, has attracted increasing attention. However, there are still many urgent challenges thus far. Herein we used seven descriptors based on understanding dissolution behavior to establish two solubility prediction models by machine learning algorithms. The solubility data of 120 active pharmaceutical ingredients (APIs) in ethanol were considered in the prediction models, which were constructed by random decision forests and artificial neural network with optimized data structure and model accuracy. Furthermore, a comparison with traditional prediction methods including the modified solubility equation and the quantitative structure-property relationships model was carried out. The highest accuracy shown by the testing set proves that the ML models have the best solubility prediction ability. Multiple linear regression and stepwise regression were used to further investigate the critical factor in determining solubility value. The results revealed that the API properties and the solute-solvent interaction both provide a nonnegligible contribution to the solubility value.

This is a preview of subscription content, access via your institution.


  1. 1.

    Ma H, Qu Y, Zhou Z, Wang S, Li L. Solubility of thiotriazinone in binary solvent mixtures of water + methanol and water + ethanol from (283 to 330) K. Journal of Chemical & Engineering Data, 2012, 57(8): 2121–2127

    CAS  Article  Google Scholar 

  2. 2.

    Maher A, Rasmuson A, Croker D, Hodnett B. Solubility of the metastable polymorph of piracetam (Form II) in a range of solvents. Journal of Chemical & Engineering Data, 2012, 57(12): 3525–3531

    CAS  Article  Google Scholar 

  3. 3.

    Ma Y, Wu S, Macaringue E, Zhang T, Gong J, Wang J. Recent progress in continuous crystallization of pharmaceutical products: precise preparation and control. Organic Process Research & Development, 2020, 24(10): 1785–1801

    CAS  Article  Google Scholar 

  4. 4.

    Wang Y, Du S, Wu S, Li L, Zhang D, Yu B, Zhou L, Bekele H, Gong J. Thermodynamic and molecular investigation into the solubility, stability and self-assembly of gabapentin anhydrate and hydrate. Journal of Chemical Thermodynamics, 2017, 113: 132–143

    CAS  Article  Google Scholar 

  5. 5.

    Wang X, Zhang D, Liu S, Chen Y, Jia L, Wu S. Thermodynamic study of solubility for imatinib mesylate in nine monosolvents and two binary solvent mixtures from 278.15 to 318.15 K. Journal of Chemical & Engineering Data, 2018, 63(11): 4114–4127

    CAS  Article  Google Scholar 

  6. 6.

    Kiwala D, Olbrycht M, Balawejder M, Piątkowski W, Seidel-Morgenstern A, Antos D. Separation of stereoisomeric mixtures of nafronyl as a representative of compounds possessing two stereogenic centers by coupling crystallization, diastereoisomeric conversion and chromatography. Organic Process Research & Development, 2016, 20(3): 615–625

    CAS  Article  Google Scholar 

  7. 7.

    Qi R, Wang J, Ye J, Hao H, Bao Y. The solubility of cefquinome sulfate in pure and mixed solvents. Frontiers of Chemical Science and Engineering, 2016, 10(2): 245–254

    CAS  Article  Google Scholar 

  8. 8.

    Herrmannsdörfer D, Stierstorfer J, Klapötke T. Solubility behaviour of CL-20 and HMX in organic solvents and solvates of CL-20. Energetic Materials Frontiers, 2021, 2(1): 51–61

    Article  Google Scholar 

  9. 9.

    Boobier S, Hose D, Blacker A, Nguyen B. Machine learning with physicochemical relationships: solubility prediction in organic solvents and water. Nature Communications, 2020, 11(1): 5753

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. 10.

    Cui Q, Lu S, Ni B, Zeng X, Tan Y, Chen Y, Zhao H. Improved prediction of aqueous solubility of novel compounds by going deeper with deep learning. Frontiers in Oncology, 2020, 10: 121

    PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Perryman A, Inoyama D, Patel J, Ekins S, Freundlich J. Pruned machine learning models to predict aqueous solubility. ACS Omega, 2020, 5(27): 16562–16567

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    ChemAxon. ChemAxon Website, 2020

    Google Scholar 

  13. 13.

    Ran Y, Yalkowsky S. Prediction of drug solubility by the general solubility equation (GSE). Journal of Chemical Information and Modeling, 2001, 32(22): 354–357

    Google Scholar 

  14. 14.

    Ellegaard D, Abildskov J, O’Connell J. Molecular thermodynamic modeling of mixed solvent solubility. Industrial & Engineering Chemistry Research, 2010, 49(22): 11620–11632

    CAS  Article  Google Scholar 

  15. 15.

    Acree W Jr, Che M, Lee G, Abraham M. Calculation of the Abraham model solute descriptors for the pharmaceutical compound acipimox based on experimental solubility data. Physics and Chemistry of Liquids, 2018, 57(3): 382–387

    Article  CAS  Google Scholar 

  16. 16.

    Sun H, Shah P, Nguyen K, Yu K, Kerns E, Kabir M, Wang Y, Xu X. Predictive models of aqueous solubility of organic compounds built on A large dataset of high integrity. Bioorganic & Medicinal Chemistry, 2019, 27(14): 3110–3114

    CAS  Article  Google Scholar 

  17. 17.

    Salahinejad M, Le T, Winkler D. Aqueous solubility prediction: do crystal lattice interactions help? Molecular Pharmaceutics, 2013, 10 (7): 2757–2766

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  18. 18.

    Chinta S, Rengaswamy R. Machine learning derived quantitative structure property relationship (QSPR) to predict drug solubility in binary solvent systems. Industrial & Engineering Chemistry Research, 2019, 58(8): 3082–3092

    CAS  Article  Google Scholar 

  19. 19.

    Fioressi S, Bacelo D, Rojas C, Aranda J, Duchowicz P. Conformation-independent quantitative structure-property relationships study on water solubility of pesticides. Ecotoxicology and Environmental Safety, 2019, 171: 47–53

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  20. 20.

    Wahab O, Olasunkanmi L, Govender K, Govender P. Prediction of aqueous solubility by treatment of COSMO-RS data with empirical solubility equations: the roles of global orbital cut-off and COSMO solvent radius. Theoretical Chemistry Accounts, 2019, 138(6): 80

    Article  CAS  Google Scholar 

  21. 21.

    Abranches D, Benfica J, Shimizu S, Coutinho J. Solubility enhancement of hydrophobic substances in water/cyrene mixtures: a computational study. Industrial & Engineering Chemistry Research, 2020, 59(40): 18247–18253

    CAS  Article  Google Scholar 

  22. 22.

    Modarresi E, Abildskov J, Gani R, Crafts P. Model-based calculation of solid solubility for solvent selections: a review. Industrial & Engineering Chemistry Research, 2008, 47(15): 5234–5242

    CAS  Article  Google Scholar 

  23. 23.

    Shang C, You F. Data analytics and machine learning for smart process manufacturing: recent advances and perspectives in the big data era. Engineering, 2019, 5(6): 1010–1016

    CAS  Article  Google Scholar 

  24. 24.

    Xie Y, Zhang C, Hu X, Zhang C, Kelley S, Atwood J, Lin J. Machine learning assisted synthesis of metal-organic nanocapsules. Journal of the American Chemical Society, 2020, 142(3): 1475–1481

    CAS  PubMed  Article  Google Scholar 

  25. 25.

    Dong Y,Wu C, Zhang C, Liu Y, Cheng J, Lin J. Bandgap prediction by deep learning in configurationally hybridized graphene and boron nitride. npj Computational Materials, 2019, 5, 26

    Article  CAS  Google Scholar 

  26. 26.

    Xin D, Gonnella N, He X, Horspool K. Solvate prediction for pharmaceutical organic molecules with machine learning. Crystal Growth & Design, 2019, 19(3): 1903–1911

    CAS  Article  Google Scholar 

  27. 27.

    Ghosh A, Louis L, Arora K, Hancock B, Krzyzaniak J, Meenan P, Nakhmanson S, Wood G. Assessment of machine learning approaches for predicting the crystallization propensity of active pharmaceutical ingredients. CrystEngComm, 2019, 21(8): 1215–1223

    CAS  Article  Google Scholar 

  28. 28.

    Paengjuntuek W, Thanasinthana L, Arpornwichanop A. Neural network-based optimal control of a batch crystallizer. Neurocomputing, 2012, 83: 158–164

    Article  Google Scholar 

  29. 29.

    Han D, Karmakar T, Bjelobrk Z, Gong J, Parrinello M. Solvent-mediated morphology selection of the active pharmaceutical ingredient isoniazid: experimental and simulation studies. Chemical Engineering Science, 2018, 204: 320–328

    Article  CAS  Google Scholar 

  30. 30.

    Wang N, Huang X, Gong H, Zhou Y, Li X, Li F, Bao Y, Xie C, Wang Z, Yin Q, Hao H. Thermodynamic mechanism of selective cocrystallization explored by MD simulation and phase diagram analysis. AIChE Journal. American Institute of Chemical Engineers, 2019, 65(5): e16570

    Article  CAS  Google Scholar 

  31. 31.

    Ma Y, Cao Y, Yang Y, Li W, Shi P, Wang S, Tang W. Thermodynamic analysis and molecular dynamic simulation of the solubility of vortioxetine hydrobromide in three binary solvent mixtures. Journal of Molecular Liquids, 2018, 272: 676–688

    CAS  Article  Google Scholar 

  32. 32.

    Zhang T, Li Z, Wang Y, Li C, Yu B, Zheng X, Jiang L, Gong J. Determination and correlation of solubility and thermodynamic properties of l-methionine in binary solvents of water + methanol, ethanol, acetone). Journal of Chemical Thermodynamics, 2016, 96: 82–92

    CAS  Article  Google Scholar 

  33. 33.

    Raudino A, Sarpietro M, Pannuzzo M. Differential scanning calorimetry (DSC): theoretical fundamentals. In: Drug-Biomembrane Interaction Studies. Pignatello R, ed. Cambridge, UK: Woodhead Publishing Limited, 2013: 127–168

    Google Scholar 

  34. 34.

    Foca G, Marchetti A, Tassi L, Ulrici A. Modelling of experimental thermophysical data by mixing of a ternary solvent system. Solution Chemistry Research Progress, 2011: 5–49

    Google Scholar 

  35. 35.

    Price S, Brandenburg J. Molecular crystal structure prediction. Non-Covalent Interactions in Quantum Chemistry and Physics, 2017, 333–363

    Chapter  Google Scholar 

  36. 36.

    Shi P, Ma Y, Han D, Du S, Zhang T, Li Z. Uncovering the solubility behavior of vitamin B6 hydrochloride in three aqueous binary solvents by thermodynamic analysis and molecular dynamic simulation. Journal of Molecular Liquids, 2019, 283: 584–595

    CAS  Article  Google Scholar 

  37. 37.

    Zhao S, Ma Y, Tang W. Thermodynamic analysis and molecular dynamic simulation of solid-liquid phase equilibrium of griseofulvin in three binary solvent systems. Journal of Molecular Liquids, 2019, 294: 111600

    Google Scholar 

  38. 38.

    Rong G, Mendez A, Bou Assi E, Zhao B, Sawan M. Artificial intelligence in healthcare: review and prediction case studies. Engineering, 2020, 6(3): 291–301

    Article  Google Scholar 

  39. 39.

    Vegh J. How Amdahl’s Law limits the performance of large artificial neural networks: why the functionality of full-scale brain simulation on processor-based simulators is limited. Brain Informatics, 2019, 6 (1): 4

    PubMed  PubMed Central  Article  Google Scholar 

  40. 40.

    Xu J, Chen Y, Xie T, Zhao X, Xiong B, Chen Z. Prediction of triaxial behavior of recycled aggregate concrete using multivariable regression and artificial neural network techniques. Construction & Building Materials, 2019, 226: 534–554

    Article  Google Scholar 

  41. 41.

    Rosenblatt F. The perception: a probabilistic model for information storage and organization in the brain. Psychological Review, 1988, 65(6): 89–114

    Google Scholar 

  42. 42.

    McDonagh J, Nath N, De Ferrari L, van Mourik T, Mitchell J. Uniting cheminformatics and chemical theory to predict the intrinsic aqueous solubility of crystalline druglike molecules. Journal of Chemical Information and Modeling, 2014, 54(3): 844–856

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Rizkin B, Hartman R. Supervised machine learning for prediction of zirconocene-catalyzed α-olefin polymerization. Chemical Engineering Science, 2019, 210: 115224

    CAS  Article  Google Scholar 

  44. 44.

    Breiman L. Random forests. Machine Learning, 2001, 45(1): 5–32

    Article  Google Scholar 

  45. 45.

    Ho T. Random decision forest. In: Proceedings of 3rd International Conference on Document Analysis and Recongnition. Montreal, Canada, 1995, 278–282

    Google Scholar 

  46. 46.

    Lee S, Kim J, Moon N. Random forest and WiFi fingerprint-based indoor location recognition system using smart watch. Humancentric Computing and Information Sciences, 2019, 9(1): 6

    Article  Google Scholar 

  47. 47.

    de Santana F, Borges Neto W, Poppi R. Random forest as one-class classifier and infrared spectroscopy for food adulteration detection. Food Chemistry, 2019, 293: 323–332

    CAS  PubMed  Article  Google Scholar 

  48. 48.

    Zhou T, Sun X, Xia X, Li B, Chen X. Improving defect prediction with deep forest. Information and Software Technology, 2019, 114: 204–216

    Article  Google Scholar 

  49. 49.

    Tarasova A, Burden F, Gasteiger J, Winkler D. Robust modelling of solubility in supercritical carbon dioxide using Bayesian methods. Journal of Molecular Graphics & Modelling, 2010, 28 (7): 593–597

    CAS  Article  Google Scholar 

  50. 50.

    Le T, Epa V, Burden F, Winkler D. Quantitative structure-property relationship modeling of diverse materials properties. Chemical Reviews, 2012, 112(5): 2889–2919

    CAS  PubMed  Article  Google Scholar 

  51. 51.

    Clark A, Labute P. Detection and assignment of common scaffolds in project databases of lead molecules. Journal of Medicinal Chemistry, 2009, 52(2): 469–483

    CAS  PubMed  Article  Google Scholar 

  52. 52.

    Molecular Operating Environment (MOE). Version 2019.0102. Montreal: Chemical Computing Group ULC, 2019

    Google Scholar 

Download references


This work was financially supported by the National Natural Science Foundation of China (Grant No. 21938009).

Author information



Corresponding authors

Correspondence to Jingcai Cheng or Junbo Gong.

Electronic supplementary material


Machine learning-based solubility prediction and methodology evaluation of active pharmaceutical ingredients in industrial crystallization

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ma, Y., Gao, Z., Shi, P. et al. Machine learning-based solubility prediction and methodology evaluation of active pharmaceutical ingredients in industrial crystallization. Front. Chem. Sci. Eng. (2021).

Download citation


  • solubility prediction
  • machine learning
  • artificial neural network
  • random decision forests