1 Introduction

In the 5-year period 2016–2020, 228 drugs were approved by the FDA, mostly for the treatment of cancer, infectious/viral diseases, and neurological disorders [1,2,3,4,5,6,7]. Of these drugs, 74% are ‘small molecule’ new molecular entities (NMEs). Many of the NMEs are larger, more lipophilic, and possess more H–bond acceptors, compared to older drugs in the Lipinski ‘Rule of 5’ (Ro5) chemical space [8, 9]. NMEs outside the Lipinski space are often dubbed ‘beyond the Rule of 5’ (bRo5) drugs [9,10,11,12,13,14,15,16,17,18]. Size inflation is not the only physicochemical characteristic of the NMEs. Some new drugs are relatively small.

Generally, large molecules may increase pharmacokinetic (PK) risks due to low solubility, possibly low cell permeability, increased efflux, and elevated metabolism. During drug discovery/early development, strategies to mitigate some of the risks have included: (i) selecting molecules which can dynamically form intramolecular H-bonds (IMHB) to shield polar groups, (ii) shielding polar groups by bulky side chains or by N-methylation, and (iii) selecting molecules with flexible rings structures [14,15,16,17,18,19]. Flexible molecules with the potential to form IMHBs have been of particular interest, since these may possess enhanced solubility in water, by adopting hydrophilic ‘extended’ conformations, as well as facilitated permeability across cell membranes, by adopting hydrophobic ‘folded’ conformations [17,18,19].

Solubility plays a central role in the fuller understanding of the PK risks. Reliable and actionable in silico models to predict solubility of NMEs and of promising molecules not yet synthesized, could be a valuable contribution to risk assessment [13]. We started to address this topic in a series of in silico studies [20,21,22]; the present contribution is a continuation of that effort.

In a recent study to predict the intrinsic solubility, log10 S0, of four standardized external test sets of mostly druglike molecules [20], three methods were critically examined: (i) Yalkowsky General Solubility Equation (GSE) [23], (ii) Abraham Solvation Equation (ABSOLV) [24], and (iii) Breiman Random Forest regression (RFR) machine learning method [25]. RFR was found to be most accurate: for a highly-curated external test set of 100 druglike molecules with consistently-determined solubility values (average interlaboratory reproducibility, SDavg ~ 0.17 log10 unit), the strength of the prediction was indicated by the coefficient of determination, r2 = 0.64, and root-mean-square error, RMSE = 0.76 (log10 unit) [20]. However, the ‘black-box’ machine learning RFR method has some disadvantages: (i) it does not directly suggest how compounds could be altered to increase/decrease their solubility [26]; (ii) there is no obvious simple explicit equation to predict solubility which could be used in a spreadsheet calculation; (iii) the method ‘learns’ superlatively but ‘teaches’ tepidly. The linear ABSOLV model, based on Abraham’s five solvation descriptors [24, 27], yielded poorer statistics: r2 = 0.26 and RMSE = 1.10 for the same test set. The GSE, Eq. 1, was even slightly less successful compared to ABSOLV (r2 = 0.20, RMSE = 1.13) [20]. Nevertheless, the simple classic GSE is particularly appealing since it requires no ‘training.’ Merely the melting point (mp in oC) and the calculated (or measured) octanol–water partition coefficient, log P, are required to predict solubility (in log molar units):

$$\log_{10} {\text{S}}_{0}^{{\text{GSE(classic)}}} = 0.5 - 1.0\,{\text{log}}_{10} \;P - 0.01(mp - 25)$$
(1)

In a follow-up study [21], solubility prediction using the above three models was applied to large molecules (MW > 800 g·mol−1). The novel aim was to explore to what extent Ro5 molecules could be used to predict the log10 S0 of molecules from the bRo5 space. For an external test set of 31 large molecules, RFR predicted solubility (r2 = 0.37, RMSE = 1.07) better than the other two methods. The RFR results suggested that it was possible to develop a model trained on small Ro5 molecules to predict the solubility of large bRo5 molecules. Unfortunately, the ‘how’ was not explicitly obvious. Nevertheless, the RFR method could serve as a benchmark against which other more actionable models could be measured. Also, the study revealed that the traditional GSE systematically underpredicts solubility of poorly soluble (S0 < 50 µmol·L−1) large molecules and greatly overpredicts solubility of highly soluble large molecules. The regression analysis of the three coefficients in Eq. 1 (0.5, − 1.0, − 0.01), using data partitioned into small and large molecule sets, resulted in notable differences between the two sets of coefficients, particularly in the first two terms (solvation contributions): (i) the 0.5 intercept in Eq. 1 was found to be − 0.28 for small molecules and − 1.77 for large molecules, and (ii) the log10 P slope factor, − 1.0, changed from − 0.83 to − 0.40 in small to large molecules, respectively [21]. The ABSOLV equation (trained with small molecules) revealed a different pattern of large-molecule residuals from that of the GSE: the solvation equation underpredicted the solubility of every large molecule tested. This was especially evident for very flexible molecules (e.g., gramicidin A, bryamycin, and vancomycin). The principal components analysis of the solubility database used to train the models revealed an asymmetric distribution in the data, resembling the shape of a ‘comet’, with small molecules symmetrically occupying the ‘head’ and large molecules (MW > 800 g·mol–1) exclusively occupying the ‘tail.’ Two hallmarks of bRo5 chemical space reside in the tail [21]: large size and large number of H-bond acceptors (NHA).

The above study [21] and earlier investigations by Caron and coworkers [15,16,17,18, 28] suggested that the influence of flexibility of large molecules on their solubility and permeability characteristics could be substantial. The latter researchers recommended the use of the Kier Φ molecular flexibility index [29] in modeling the properties of bRo5 molecules.

In our most recent solubility prediction study of bRo5 drugs, we discovered a way to incorporate the Kier molecular flexibility index, Φ, plus the Abraham B descriptor (H-bond acceptor strength) into Yalkowsky’s classic GSE to improve its performance substantially [22]. The three coefficients in Eq. 1 were empirically determined as smooth functions of the sum descriptor, Φ + B. The modified equation was named the ‘Flexible-Acceptor’ model, GSE(Φ,B). It was trained with small (Ro5) molecules to predict the solubility of large (bRo5) molecules (not used in the training). With just three coefficients in Eq. 1, each defined as a three-parameter exponential function of Φ + B, the strength of prediction nearly matched that of the RFR machine learning method. The coefficient of log10 P (traditionally fixed at -1.0) changed smoothly from − 1.1 for rigid nonionizable molecules (Φ + B = 0) to − 0.39 for typically flexible (Φ ~ 20, B ~ 6) large molecules. The intercept (usually fixed at + 0.5) varied smoothly from + 1.9 for rigid small molecules to − 2.2 for flexible large molecules. The mp coefficient remained practically constant, slightly different from the traditional value (− 0.01) for most molecules. For a test set of 32 large molecules the GSE(Φ,B) predicted the intrinsic solubility with RMSE of 1.10 log unit, compared to 3.0 by GSE(classic), and 1.07 by RFR.

Since our last study, it was found that the GSE (Φ,B) appears to work well not only for large drugs, but across a wide range of sizes of molecules. This piqued our interest to direct the new solubility prediction equation to recently-approved drugs (2016–2020), which comprise both the bRo5 and Ro5 molecules. For comparison, the GSE(classic), ABSOLV, GSE(Φ,B), and RFR models were each applied to predict the intrinsic solubility of 72 new drugs, for which useable reported solubility values could be accessed, nearly 60% from FDA published New Drug Application (NDA) reports [1,2,3,4,5,6,7]. The method performances ranked: RFR ~ GSE (Φ,B) > ABSOLV > GSE(classic). The performance of the GSE (Φ,B) was almost as good as that of the RFR. However, when the GSE (Φ,B) and ABSOLV methods were averaged, the resulting consensus model slightly outperformed the RFR method.

2 Computational Methods and Data Sources

2.1 Thermodynamic Basis of the General Solubility Equation (GSE)

Yalkowsky and coworkers developed the General Solubility Equation (GSE), Eq. 1, to predict the solubility of liquid/solid nonelectrolytes (mostly industrial organic chemicals) in water [23, 30,31,32,33,34,35]. The thermodynamic basis of the equation posits that the dissolution of a crystalline substance in water comprises two main contributions: (a) crystal lattice effect (XTL), related to the energy needed to break down the lattice to form a hypothetical ‘supercooled liquid’ (SCL), and (b) solvation effect, related to the energy released as the SCL dissolves in water. The total solubility of the compound in water is the product of the above two contributions, which in logarithmic terms can be stated as the sum [33, 34]

$$\log_{10} S{\text{ = log}}_{10} \,S_{W}^{{{\text{STL}}}} + \log_{10} S_{W}^{{{\text{SCL}}}}$$
(2)

2.1.1 Crystal Lattice Effect

The lattice contribution, \(\log_{10} S_{{\text{W}}}^{{{\text{XTL}}}} = - {{\Delta S_{m} \left( {T_{m} - T} \right)} \mathord{\left/ {\vphantom {{\Delta S_{m} \left( {T_{m} - T} \right)} {2.303RT}}} \right. \kern-\nulldelimiterspace} {2.303RT}}\), arises from the application of the van’t Hoff equation, where ∆Sm (kJ·mol–1·K–1) is the standard molar entropy of phase transformation and Tm is the melting point (K). For many small organic compounds, ∆Sm ≈ 0.057 kJ·mol–1·K–1 [33, 34]. Since at 25 °C, 2.303 RT = 5.7 kJ·mol–1·K–1, Eq. 2 reduces to Eq. 3, where mp is the melting point in °C.

$$\log_{10} S = \log_{10} S_{W}^{{{\text{SCL}}}} - 0.01(mp - 25)$$
(3)

2.1.2 Solvation Effect

Hansch and coworkers [36] demonstrated that log10 S of 156 simple liquid solutes correlated linearly with the octanol–water partition coefficients, \(\log_{10} P \approx \log_{10} \left( {{{S_{{{\text{oct}}}}^{{{\text{liq}}}} } \mathord{\left/ {\vphantom {{S_{{{\text{oct}}}}^{{{\text{liq}}}} } {S_{{\text{W}}}^{{{\text{liq}}}} }}} \right. \kern-\nulldelimiterspace} {S_{{\text{W}}}^{{{\text{liq}}}} }}} \right)\). This led to the approximation:

$$\log_{10} S_{W}^{{{\text{liq}}}} = \text{a}_{0} + \text{a}_{1} \log_{10} P$$
(4)

where \({\text{a}}_{0} = \log_{10} S_{{{\text{oct}}}}^{{{\text{liq}}}}\) (solubility of a liquid solute in octanol) and a1 ≈ − 1. For small alcohol, aromatic, and alkane solutes, the series-dependent a0 intercepts were determined as: + 0.93, + 0.34, − 0.25, respectively. The a1 slope factors varied less: − 1.1 (alcohols), − 1.0 (aromatics), and − 1.2 (alkanes).

Yalkowsky and coworkers surmised that \({\text{a}}_{0} = \log_{10} S_{{{\text{oct}}}}^{{{\text{SCL}}}}\) = 0.5 [23]. The entropy of mixing favors complete miscibility of the two liquids (liquid solute and octanol); i.e., the mole fraction = 0.5. Since the concentration of pure octanol is 6.32 mol·L−1, then \(\log_{10} S_{{{\text{oct}}}}^{{{\text{liq}}}}\) = log10 (6.32 × 0.5) = 0.5. With this approximation (and with a1 =  − 1), Eq. 4 substituted into Eq. 3 reduces to Eq. 1.

These fundamental considerations suggest that the traditional Eq. 1 could be adapted to compounds from the bRo5 chemical space, since Hansch’s research hinted that the three coefficients in Eq. 1 could be optimized to different classes of compounds. If the ‘supercooled liquid’ form of a large polar solute is not fully miscible with octanol, then the \(\log_{10} S_{{{\text{oct}}}}^{{{\text{SCL}}}}\) contribution could very well be a negative number. Hence, a large molecule with a decreased \(S_{{{\text{oct}}}}^{{{\text{SCL}}}}\) (due to decreased miscibility) is expected to have an increased \(S_{{\text{W}}}^{{{\text{SCL}}}}\). This, in effect, would lessen the contribution of lipophilicity to the predicted solubility.

2.2 ‘Flexible-Acceptor’ General Solubility Equation, GSE(Φ,B)

In our earlier investigation [22] it was found that molecular flexibility (Φ) [29] could be incorporated into a nonlinear variant of the GSE to produce a promising trainable model with improved accuracy in predicting the solubility of large molecules (MW > 800 g·mol–1). Further incremental improvements were achieved with an augmented second descriptor: Abraham’s H-bond basicity (B), a measure of H-bond acceptor potential [24, 27]. The derived GSE(Φ,B) has the general form, with the three c-coefficients treated as three-parameter exponential functions of (Φ + B):

$$\log_{10} S_{0}^{\text{GSE}({\varPhi} ,B)} = c_{0} + c_{1} \cdot \log_{10} \;P + c_{2} \cdot (mp - 25)/100$$
(5)
$${\text{c}}_{0} = {\text{b}}_{0} + {\text{b}}_{1} \exp ( - {\text{b}}_{2} \cdot ({\varPhi} + B))$$
(6)
$${\text{c}}_{{1}} {\text{ = b}}_{3} + {\text{b}}_{4} \,\left[ {1 - \exp ( - {\text{b}}_{{\text{s}}} \cdot ({\varPhi} + B))} \right]$$
(7)
$${\text{c}}_{2} = {\text{b}}_{6} + {\text{b}}_{7} \,\left[ {1 - \exp \,( - {\text{b}}_{8} \cdot ({\varPhi} + B))} \right]$$
(8)

The c-coefficients at aggregated values of Φ + B were determined by partial least squares (PLS open-source package from https://cran.r-project.org/web/packages/pls) analysis of solubility data sorted on values of Φ + B and uniformly binned into groupings of 209–1384 points. The details of the PLS procedure have been already described [22]. Since our database of solubility values has increased in size since our last study and since the focus now is on new drugs rather than specifically on big drugs, a new set of b-constants was determined in the current investigation, using druglike molecules as the training set, but excluding new drugs from the training.

Kier [29] constructed (considering structural attributes such as counts of chains, rings, branches, and heavy atoms) the molecular flexibility index, Φ, as the product of first and second order ‘kappa’ shape indices, 1k and 2k, divided by the heavy atom count in the molecule. Here, values of Φ were calculated from the two kappa and the heavy atom count descriptors provided by the Landrum’s RDKit open-source chemoinformatics library [37]. Table 1 lists these Φ values.

Table 1 Physicochemical properties of newly approved drugs (2016–2020)

2.3 Abraham Descriptors and the ABSOLV Linear Model

To account for the thermodynamics of solute transfer from one phase to another, Abraham [24, 27] introduced five solvation descriptors: A, B, Sπ, E, and V. Two of these constitute hydrogen bonding potential: A is the sum of H-bond acidity (donor strength) and B is the sum of H-bond basicity (acceptor strength) in the molecule. Sπ is the dipolarity/polarizability (subscripted so as not to confuse it with solubility), E is an excess molar refraction in units of (cm3·mol−1)/10, and V is the McGowan characteristic volume in units of (cm3·mol−1)/100. Since large molecules have greater number of H-bond acceptors, compared to small molecules [21], the Abraham B descriptor was selected to augment the Φ descriptor to further improve solubility prediction in the bRo5 chemical space [22]. Values of the Abraham descriptors were calculated from 2D structures using the ABSOLV algorithm [27] (cf., www.acdlabs.com) and are listed in Table 1 for the new drugs.

Abraham and Le [24] amended the ABSOLV model to predict intrinsic solubility (log molar):

$$\log_{10} S_{0}^{\text{ABSOLV}} = {\text{d}}_{0} + {\text{d}}_{1} A + {\text{d}}_{2} B + {\text{d}}_{3} S_{\pi } + {\text{d}}_{4} E + {\text{d}}_{5} V + {\text{d}}_{6} A \cdot B$$
(9)

The independent variables are the five solute descriptors, plus the cross product of the H-bond terms. The seven d-coefficients were determined by PLS regression, using the training set database, exclusive of the new drugs set.

2.4 Statistical Machine Learning Random Forest Regression (RFR) Model

The implementation of the RFR open-source ‘randomForest’ library for the R statistical software has been described in our earlier solubility prediction studies [20,21,22]. The version used was downloaded from https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm. The method works by constructing an ensemble of hundreds of decision trees employing about 200 RDKit-generated molecular descriptors. The same procedure was applied in the current study. The method was re-trained with the enlarged database, excluding the newly-approved drugs.

2.5 Sources of Solubility Data for the Test (New Drugs) and Training (Wiki-pS 0 Database) Sets

The annual mini-reviews of FDA drug approvals by Mullard [1,2,3,4,5] were convenient starting points to identify the new drugs and to begin the search for their solubility values. The data for the newly-approved drugs were wearisome to locate. Since the drugs are relatively new, there are not many journal publications reporting properties of the compounds. Most of the data were found in FDA documents. As part of the New Drug Application (NDA) process, the FDA Center for Drug Evaluation and Research (CDER, www.accessdata.fda.gov) publishes reports listing some properties of compounds under consideration (review documents: Product Quality, Quality Assessment, Multi-Discipline, Clinical Pharmacology and Biopharmaceutics, and Other). Unfortunately, sometimes the information about solubility is redacted in these reports. Other useful sources include Product Monographs, Highlights of Prescription Information, and Safety Data Sheets. Some solubility data were found in patents. The European Medicines Agency (EMA) publishes Assessment Reports. The Australian regulatory agency publishes Australian Public Assessment Reports (AUSPAR), as well as Australian Product Information documents. These potential sources of measured solubility data were searched with the ‘solub’ key.

Generally, there was virtually no experimental detail about the measurements in the published regulatory reports. Most of the reported solubility values are of drugs in water (SW), without mention of the saturation pH. The temperature was assumed to be 23 °C when not stated or when reported as ‘room temperature’ (Table 1). In the dearth of experimental detail, it is a challenge to assess the quality of the reported measurements in most of the FDA/EMA/AUSPAR reports. Still, there are high quality data in some of the documents, where solubility measurements were published as a function of pH. Examples of some of these are presented below.

Of the 169 small-molecule NMEs approved in the 5-year period, 98 quantitative solubility measurements were found for only 72 NMEs [38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115]. The reported values were transformed into the intrinsic solubility scale, S0, using known (or predicted) pKa values, and adjusted to 25 °C [116] using the program pDISOL-X (in-ADME Research) [117,118,119,120,121,122]. Table 1 lists the normalized solubility data, along with the pKa values used in the data analysis.

The Wiki-pS0 (in-ADME Research) intrinsic aqueous solubility database of mostly druglike molecules (currently with 7190 deeply-curated entries) was used to train the ABSOLV, GSE(Φ,B) and RFR models. Several hundred values from the database have already been published [20,21,22, 116,117,118,119,120,121,122,123,124,125,126], and the entire database is currently being prepared for publication as a book. The newly-approved drugs were used as external test sets and were excluded from the training process.

The structures of the 72 new drugs considered here (along with the year of approval) are shown in the Appendix (Fig. 9). In dual-API drug products, each API was treated as a separate ‘drug’ in the data analysis.

2.6 Sources of Octanol–Water Partition Coefficients (log10 P) and Melting Points (mp)

Originally in Eq. 1, mp and log10 P, were taken to be experimental values. However, it has become a common practice to use calculated values, clogP, in place of measured log10 P. In this study, clogP values were in all cases calculated by the Wildman–Crippen sum of atomic contributions method in the open-source RDKit chemoinformatics library [37]. Experimental mp values were employed where available and were calculated otherwise [127]. Values of mp are difficult to predict accurately. Prediction studies suggest root-mean-square error of about 35 °C. From this, the mp contribution to calculated log S could be uncertain by ~ 0.4 log10 unit. Some uncertainty lingers even with tabulated experimental values, as it is sometimes unclear whether a particular mp value refers to a salt form or a free-acid/base form of the compound.

3 Results and Discussion

3.1 Data Reduction

For half of the new drugs, log10 S0 values in Table 1 were determined from reported SW values, using pDISOL-X. The program also calculated the pH of the saturated solution, as though the Henderson–Hasselbalch (HH) equation were valid. When aggregates/complexes form or when supersaturation persists in the suspension, the HH equation does not accurately predict the shape of the log10 S–pH curve [118,119,120,121,122]. There is no simple way to recognize such anomalies just from a single SW measurement.

The remaining log10 S data were sourced at two or more values of pH, which generally allowed for more confident determinations of log10 S0. These ‘raw’ log10 S–pH measurements required further data reduction and normalization. For ionizable molecules, the pKa values are required for such analysis. In cases where measured pKa values could not be found, they were calculated using the ChemAxon MarvinSketch v5.3.7 program (ChemAxon Ltd., https://www.chemaxon.com), as indicated by italic values in Table 1. In a few cases, it was possible to determine pKa values directly in the analysis of the log10 S–pH profiles (bold values in Table 1).

Examples of quality experimental log10 S–pH profiles reported for some of the new drugs are shown in Fig. 1. Frames a–c are of bases (acalabrutinib, pexidartinib, upadacitinib); frame d is that of an acid (dolutegravir); frame e is that of an ampholyte (talazoparib). The data from these five drugs appeared to follow shapes predicted by the Henderson–Hasselbalch equation: it was possible not only to determine the best-fit log10 S0, but also the values of pKa (in five cases) and the pKsp (in two cases). When profiles deviate from expected shapes, it may be possible to assess (and to correct for) the degree to which the measurements may be supersaturated or if aggregates/complexes are forming [118,119,120,121,122]. Figure 1f (safinamide) shows such an example of anomaly, where at pH 4.5, the solubility is higher than that expected for a solution saturated in the free base. Since the solubility values at pH 1.2 and 4.5 are nearly the same, the suspension at pH 4.5 may have been supersaturated with respect to the charged form of the base during the measurement. Had only a single measurement been reported at pH 4.5, the intrinsic solubility might have been determined at an order of magnitude too high.

Fig. 1
figure 1

New-drug examples of log10 S–pH profiles of good disposition. The solid red curves are the best fit to the measured data (circles), using the regression analysis program pDISOL-X. It was also possible to determine the pKa values in cases (a–e). The dashed curves were calculated using the Henderson–Hasselbalch equation, incorporating the pKa used and the refined log10 S0. In cases (b), (d), and (f), it was possible to determine the salt solubility products (Color figure online)

3.2 Comparing Properties of the Newly-Approved Drugs to Those in the Database Training Set

Figure 2 shows the distribution of intrinsic solubility values for the database training set and the NMEs test set. The new drugs on the average are nearly an order of magnitude less soluble. Figure 3 compares the properties used to evaluate whether a compound falls into Lipinski’s ‘Rule of 5’ chemical space. The lipophilicities (as indicated by clogP) of the new drugs on the average are nearly an order of magnitude higher than those of the older drugs (Fig. 3a). The mean molecular weight of the older drugs is just under 300 g·mol–1; it is 450 g·mol–1 for the new drugs (Fig. 3b). Whereas the distribution of H-bond donors is nearly the same in the two sets (Fig. 3c), the distribution of H-bond acceptors is quite different (Fig. 3d). On the average, the number of H-bond acceptors, NHA, is about 4 per molecule for older drugs and nearly 7 per molecule in the newly-approved drugs.

Fig. 2
figure 2

Distribution of intrinsic solubility values, log10 S0, in the training database set (green upper trace, scaled to the left vertical axis) and the test set (red lower trace, scaled to the right vertical axis) of newly approved drugs. The bell-shaped curves illustrate that the newly-approved drugs are about an order of magnitude less soluble (mean S0 = 79 µmol·L−1) than the training-set molecules (mean S0 = 631 µmol·L−1) (Color figure online)

Fig. 3
figure 3

Lipinski’s Ro5 property distributions: a clogP (RDKit-calculated log10 P, Wildman-Crippen type) b molecular weight (MW), c number of H-bond donors (NHD), and d number of H-bond acceptors (NHA). The training set databases distributions are in the upper traces (with counts scaled to right vertical axis) and the test set new drug distributions are in the lower traces (with counts scaled to right vertical axis). On the average, compared to the training-set molecules, the newly-approved drugs are about 1.4 times more lipophilic, have greater molecular weights by about 1.5 times, have similar distributions of H-bond donors (2/molecule), but have higher numbers of H-bond acceptors (7/molecule, compared to 4/molecule in the training set) (Color figure online)

The new drugs violate the boundary conditions of Lipinski’s Ro5 more often than those in the training set. There are relatively more molecules with clogP > 5 in the new drugs set (15% of NMEs), compared to that of the training set (6%). For the new drugs, 28% of the substances have MW > 500 g·mol–1, compared to 7% in the training set. The relative number of NHD > 5 in the new drugs set (5%) is higher than in the training set molecules (2%). The relative number of NHA > 10 for the new drugs (9%) is greater than in the training set (3%).

The distributions of Φ values for the training and test sets are shown in Fig. 4. On the average, the new drugs are more flexible (mean Φ = 6.2) than the molecules in the training set (mean Φ = 4.3). The training set spans a wide range of Φ values, from 0.4 to 43. The newly-approved drugs subtend that space, with Φ values ranging from 1.9 to 24.

Fig. 4
figure 4

Distribution of the Kier flexibility index, Φ, in the training database set (upper trace, left axis scale) and the test set (lower trace, right axis scale) of newly approved drugs. The index was calculated using RDKit ‘kappa’ descriptors (see text). The newly-approved drugs are more flexible than those in the training set (6.2 vs. 4.3) and span a narrower range of values

3.3 Determination of the Three GSE Coefficients from Training Set iso-(Φ + B) Bins

The training set solubility data were sorted by Φ + B into ten bins of increasing values. For a narrow range of Φ + B values in each bin, the three GSE coefficients in Eq. 5 were determined by linear PLS regression, in the way that Hansch et al. [36] had trained the GSE for different chemical classes of compounds. Table 2 lists the set of determined c-constants for each of the bins. The resultant c-constants are depicted by the points on the three curves in Fig. 5, displayed as a function of the average values of Φ + B from each bin. It is possible to recognize trends for the substantially decreasing c0, the steadily increasing c1, and the very slightly increasing c2 coefficients with increasing values of Φ + B. Apparently, crystal lattice contributions are not appreciably affected by molecular flexibility and H-bond acceptor character, and trend near the traditional value (− 0.01) in Eq. 1. Evidently, solubility dependence on flexibility and H-bond acceptor strength are mediated by solution-phase interactions [26]. The bin analysis results are summarized in Table 2.

Table 2 PLS analysis in bins ordered by Φ + B: \(\log_{10} S_{{0}}^{\text{GSE}}\) = c0 + c1 clogP + c2 (mp − 25)/100
Fig. 5
figure 5

Training the Flexible-Acceptor GSE (Φ,B) model. The solubility data in the training set were sorted on Φ + B and then divided into ten practically constant (Φ + B) bins (Table 2). On the average, each bin contained about 700 log10 S0 measurements. For each bin (represented by a point in the plots), the three constants in Eq. 1 were determined by PLS regression to best fit the intra-bin solubility data. The aggregate intercept coefficient, c0(Φ,B) in Eq. 6, is described by an exponential decay function spanning from + 1.1 (rigid, octanol-miscible molecules) to − 3.4 (flexible, octanol-immiscible molecules). In the next two frames, the coefficient functions, c1(Φ,B) in Eq. 7 and c2(Φ,B) in Eq. 8, are depicted by exponential rises to the limiting values − 0.27 (= − 1.369 + 1.099)) and − 0.52 (= − 1.128 + 0.608), respectively

From the thermodynamics considerations, the c0 coefficient may be viewed as a measure of the solubility of the ‘supercooled’ liquid solute in octanol (c0 ≈ \(\log_{10} S_{{{\text{oct}}}}^{{{\text{sliq}}}}\)). Increasingly flexible molecules with strong H-bond acceptor character appear to be less miscible with octanol, as suggested by the decreasing c0 coefficients with increasing Φ + B (cf., Table 2 and Fig. 5). Between bins 1 and 10, \(S_{{{\text{oct}}}}^{{{\text{sliq}}}}\) decreases by four orders of magnitude. Given that the c1 coefficient also changes with Φ + B, the precise thermodynamic interpretation of the c0 coefficient is less clear than in the classical derivation [23, 33, 34] where c1 is constant.

The points in Fig. 5 were fitted to exponential forms as functions of Φ + B to determine the b-parameters (Eqs. 68), using standard nonlinear least-squares methods. The resultant best-fit curves in Fig. 5 define the aggregated form of the GSE(Φ,B), in the final form with the nine b-parameters determined, as shown below.

$${\text{c}}_{0} = 3.464 + 5.431\exp ( - 0.1228 \cdot (\varPhi + B)$$
(10)
$${\text{c}}_{1} = 1.369 + 1.099\left[ {1 - \exp ( - 0.1343 \cdot (\varPhi + B))} \right]$$
(11)
$${\text{c}}_{2} = 1.128 + 0.608\left[ {1 - \exp ( - 0.247 \cdot (\varPhi + B))} \right]$$
(12)

3.4 ABSOLV Training

The d-coefficients in Eq. 9 were determined by PLS regression using the log10 S0 values from the Wiki-pS0 database, excluding those of the NMEs: r2 = 0.65, RMSE = 1.16, n = 7092.

$$\log_{10} S_{0}^{\text{ABSOLV}} = - 0.640 + 0.128A + 1.751B + 0.083S_{\pi } - 1.526E - 1.223V + 0.065A \cdot B$$
(13)

3.5 Solubility Prediction Results for the Newly-Approved Drugs

3.5.1 Model Training

Figure 6 shows the results of the training of the four models, as measured log10 S0 vs. calculated log10 S0 correlation plots. The solid diagonals are identity lines. The dashed diagonals are ± 0.5 log10 unit displaced from the identity lines. The measure of prediction performance (MPP) is indicated by the pie-charts as the percentage of predicted values that are within ± 0.5 log10 unit of the observed values [128]. In the first three frames, the symbols represent the predominant charge states of molecules at pH 7.4: black diamonds represent uncharged molecules, blue squares represent bases (positive charged), red circles represent acids (negative charged), and yellow diamonds represent zwitterions. The zwitterions are less well predicted in the GSE model, compared to the ABSOLV model [20]. The Random Forest Regression (RFR) internal validation was applied to randomly-selected 30% of the database, based on training using the other 70% of the database (exclusive of new drugs). For molecules like those of the database, it is expected that their log10 S0 could be predicted with r2 = 0.90, RMSE = 0.62, with 76% of the molecules ‘correctly’ predicted (Fig. 6d). Generally, the other three methods are less precise, with MPP values ranging around half of the RFR value.

Fig. 6
figure 6

Training set predictions of the four models considered: measured log10 S0 vs. calculated log10 S0. The solid diagonals are identity lines. The dashed diagonals are ± 0.5 log10 unit displaced from the identity lines. The pie-charts indicate the percentage of ‘correctly’ predicted values (see text). a GSE(classic) model, according to Eq. 1 (untrained). b ABSOLV model, Eq. 13, with coefficients determined by PLS regression. c Flexible-Acceptor GSE(Φ,B) model, according to Eqs. 1012 (see text), and d Random Forest Regression (RFR) internal validation applied to randomly-selected 30% of the database, trained using the other 70% of the database

3.5.2 Model Testing

Figure 7 shows the results of the predictions of the solubility of the newly-approved drugs (external test sets) by the four models. Table 3 summarizes the results. Briefly, the four results look similar, as MPP values range from 28 to 39%. Note that the horizontal scale in Fig. 7a is quite different from those of the other frames. The GSE(classic) underperformed compared to the other methods. The Flexible-Acceptor model produced prediction metrics nearly equal to those of RFR. None of the methods produced RMSE < 1, which may be indicative of the uncertain quality of half of the new drugs solubility data reported as single-point values in water. On the other hand, RFR uncharacteristically overpredicted the solubility of drugs with log10 S0 < 7, which may hint that those molecules possessed structural features not found in the Wiki-pS0 training database. The GSE(Φ,B) also shows similar overpredictions.

Fig. 7
figure 7

Test set predictions of the four models considered: measured log S0 of newly-approved drugs vs. calculated log S0. See Fig. 6 caption for definitions of common features. a GSE(classic) model, according to Eq. 1 (untrained). b ABSOLV model, Eq. 13, with coefficients determined by PLS regression. c Flexible-Acceptor GSE(Φ,B) model, according to Eqs. 1012, with (Φ + B)-dependent c-coefficient functions determine by PLS regression (see text), and d Random Forest Regression (RFR) external test set of newly approved drugs

Table 3 Predicted intrinsic solubility (log10 S0) of newly approved drugs (2016–2020)

The bias in the predicted results is near zero (− 0.06) in the RFR method. Both GSE methods show negative bias (− 0.33 and − 0.25), whereas the ABSOLV method produces a positive bias (+ 0.29). A consensus model was suggested by averaging the ABSOLV and GSE(Φ,B), to minimize the method bias. Figure 8 shows the results of the consensus model. Although the r2 (0.67) and RMSE (1.07) values in the consensus method match those of the RFR method, the MPP value (40%) and the bias (+ 0.02) in the consensus model are slight improvements.

Fig. 8
figure 8

Consensus model: average of GSE(Φ,B) and ABSOLV model predictions applied to newly-approved drugs

3.5.3 More is Needed than Just Increasing the Size of the Training Set

The Wiki-pS0 database of druglike molecules has steadily grown over the last 10 years. Lately, it has been our observation that this alone has not proportionately improved its ability to predict the solubility of drugs. Metrics such as those in Fig. 6 have remained largely unchanged [20,21,22]. Solubility prediction depends on multi-dimensional factors (quality of measurements of both training and test sets, distribution of training set molecules in chemical space in relation to the tested drugs, sensitivity of descriptors used in prediction models, etc.), with some factors yet to be recognized. Simply increasing the size of the solubility training set may not lead to improved predictions. Lipinski has suggested that compiling a large physicochemical property database aimed at maximizing chemical diversity may be an inefficient strategy for predicting the properties of novel molecules, given the enormous size of the chemical space, and since drugs appear to exist there as small tight clusters [129]. However, small improvement in solubility prediction can be expected as the training set acquires additional measurements of regulatory newly-approved molecules on a regular basis—i.e., drawing from the “tight cluster” space. It would be helpful if the quality of such measurements were to improve with time. New descriptors which can better differentiate the factors affecting solubility also can be important for narrowing the gap between the accuracy of the prediction models and that of the experimental data.

4 Conclusion

Many of the new drugs are large and fall outside of the Lipinski Ro5 chemical space, as depicted in Fig. 3. It would have been helpful to have access to more quantitative solubility measurements of the newly-approved drugs than provided in the regulatory agency reports. The experimental uncertainty of nearly half of the new measurements could not be directly verified. If better practices in solubility measurement were adhered to, as detailed in the recent data-quality ‘white paper’ by experts from six countries [121], and the experimental details were more openly shared, newly-reported measurements could achieve results with interlaboratory SD < 0.2 log10 unit. But apparently this is work still in progress. The data quality in the curated database (SD < 0.2 log unit) used here as the training set is not the limiting factor in prediction, given that the best root-mean-square error achieved in this study was above a log unit. The benchmark statistical machine learning approaches are probably up to the task in narrowing the gap between prediction and measurement. The Flexible-Acceptor GSE(Φ,B) performed nearly as well as the benchmark Random Forest regression method in predicting the aqueous intrinsic solubility of the newly-approved drugs (2016–2020). A similar near-match had been previously reported by us in the prediction of the solubility of large (bRo5) drugs, supporting the general applicability of the Flexible-Acceptor model. A consensus model based on the average predictions of the ABSOLV and GSE(Φ,B) methods was found to reduce the prediction biases in the separate methods, but perhaps even more significant, it slightly outperformed the Random Forest regression method overall. The relatively-simple consensus model can be readily incorporated into spreadsheet calculations.