Introduction

Sugars are short-chain carbohydrates, with their molecule consisting of carbon (C), hydrogen (H) and oxygen (O) atoms with the general formula Cm(H2O)n where 2 ≤ m (and usually 3 ≤ m ≤ 7) and n ≤ m (and usually n = m or n = m − 1).

Monosaccharides are the simplest carbohydrates, with general formula (CH2O)n, where n ranges from 2 (diose, H–(C=O)–(CH2)–OH) to usually 7 (when n = 3 are trioses, when n = 4 are tetroses, when n = 5 pentoses, when n = 6 hexoses and when n = 7 heptoses). There are 24 monosaccharides (see Table 1) from diose (n = 2) to hexoses (n = 6). The monosaccharides with lower number of atoms (e.g., n = 3 and n = 4) may cyclize by dimerization leading to cyclic monosaccharides with n = 6 and n = 8, respectively, and also, the monosaccharides can join together to form disaccharides. A disaccharide is formed whenever two monosaccharides (identical or not) joined. Two identical monosaccharides can form up to eleven different disaccharides (Paulus and Klockow-Beck 1999), and the numbers increase even more abruptly when different monosaccharides are connected [Schmidt (1986) counted 720 trisaccharides, 34,560 tetrasaccharides and 2,144,640 pentasaccharides] and the consequence is an enormous diversity and complexity in carbohydrate structure and chemistry.

Table 1 Monosaccharides from diose to hexoses in open-chain (acyclic) form

In Table 1, ‘n = ’ stands for the n from the general formula (CH2O)n of the monosaccharides, while the split into aldoses and ketoses is based on the position of the double-bonded oxygen in their structure.

The solubilities of monosaccharides are reported in both experimental and theoretical studies (Banerjee 1996; Briciu et al. 2010; Sârbu and Briciu 2010; Kot-Wasik et al. 2014), but unfortunately are at different experimental conditions (18 °C or 20 °C (ChemicalBook 2017) or other temperatures, 10 mmHg, 18 mmHg or other different barometric pressures (ChemSpider 2017) in place of IUPAC SATP of 25 °C and 1 bar (McNaught and Wilkinson 1997), or different solvents or solvent mixtures (van Putten et al. 2014)).

Actually, there are very few experimental data on solubilities of monosaccharides reported at IUPAC SATP conditions. Gray et al. (2003) reported solubilities for five monosaccharides, while (Teles et al. 2016) have the same five ones.

In addition, the recent literature is abundant of studies in connection with the solubility of monosaccharides and of a growing interest is not only their solubility in water, but also in water-based solvents and solvent mixtures. Thus, Ye et al. (2017) reported solubility data for butanol–water mixtures, while others reported for ionic aqueous solvent mixtures including water + NaCl (Hernández-Luis et al. 2003; Ghalami-Choobar et al. 2015), water + NaBr (Zhuo et al. 2005), water + NaI (Zhuo et al. 2008), water + LiCl, water + KH2PO4 and water + NaC6H11O7 (Banipal et al. 2014a, b, 2015), water + 3-hydroxypropylammonium acetate (Singh et al. 2015), water + 1-hexyl-3-methyl imidazolium chloride (Zafarani-Moattar et al. 2017) and even in a series of ionic aqueous solvent mixtures (Carneiro et al. 2013). Also, the solubility of other compounds in mixtures of water + monosaccharides is of interest as well—Nain (2016) reporting data for water + d-mannose solvent mixtures. Solubility-related recent studies include solid–liquid and vapor–liquid phases equilibrium data for some monosaccharides as well as some disaccharides (Jónsdóttir et al. 2002) and solid–liquid phase equilibrium data for binary and ternary systems of certain monosaccharides and water (Guo et al. 2017).

The main problem in conducting studies relating the experimental measurements on carbohydrates is the scarcity of structural information from combined factors (difficulties to crystallize and the limitations in NMR analysis (Zwahlen and Vincent 2002)). Another challenge is the fact that usually the researchers conducting the structural determinations are not the same with the ones conducting the property measurements, and by this way, the reliability of the data sources is reduced, since very easily during the experimental treatment, the monosaccharides may switch from the acyclic to cyclic form and the cyclic forms can undergo mutarotation.

Other data which may be of interest reported for monosaccharides include equilibrium constants of their complexes, but also here the information available is scarce; Hacket et al. (1997) reports the equilibrium constants of complexes between β-cyclodextrin and three out of the 24 monosaccharides listed in Table 1, obviously not enough paired data to do an analysis in the series.

The increasing interest of series-based data including the monosaccharides is confirmed by a recent study of Buttersack (2017) which reports data not only for monosaccharides (most of pentoses and all hexoses) which estimated hydrophobicity by direct measurement of the hydrophobic interaction of carbohydrates and other hydroxy compounds with a C18-modified silica gel column.

The solubilities in standard conditions for 8 out of the 24 monosaccharides listed in Table 1 were involved in this study to obtain relations expressing the solubility as a function of the structure of monosaccharides in their acyclic form.

Material and method

The available data about solubilities of 8 monosaccharides (listed in Table 2) were collected from the literature. For one (fructose) was necessary an extrapolation, while for other two (allose and psicose) a conversion of the units was made.

Table 2 Monosaccharides experimental solubilities in mole fraction (mol/mol) at IUPAC STAP conditions along with identifiers of their chemical structure (PubChem CID)

In water exists all forms but only one is ‘unique’—e.g., does not have different conformation states—the acyclic form, and this is the reason for which it was used. If it is something in structure which explains its behavior in water, then it is present in all its forms including the acyclic one. The advantage of using the acyclic form is given by its uniqueness, which allows to do the desired inference in the whole set of structures.

The structural information as 3D geometries for the D-type isomers of acyclic forms was taken from PubChem database (CID numbers of the files given in Table 2). For one monosaccharide (CID 111123 corresponding to D-idose), the 3D geometry were built from its 2D geometry. On the 24 files containing different geometries of monosaccharides were calculated properties using Spartan’14 software in the following configuration: energy calculation with Hartree–Fock (HF) method, 6-31G* dual basis (Steele and Head-Gordon 2010); the infrared (IR) parameters (Pople et al. 1989) were computed too and thermodynamic entities were derived (CV—molar heat capacity at constant volume H—enthalpy, S—entropy, G—free enthalpy, C 0 V , H0, S0, G0 at 298.15 K and S0K, C 0K V , ZPE—zero point energy—all at 0 K).

On the CID files containing the structures listed in Table 2 were calculated the extended characteristic polynomials. Since the procedure of calculation for the extended polynomial is detailed elsewhere (Jäntschi and Bolboacă, 2016; Joiţa and Jäntschi, 2017), here it is given only in brief. The extended characteristic polynomial (ChPE) is calculated on the chemical structure containing only the heavy elements (without hydrogen atoms). The extended characteristic polynomial (ChPE) on a molecule (Mol) on the evaluation point λ is calculated depending on the choice of the atomic property (AP) and of the metric operator (MO) by using the identity (I) and the connectivity (C) matrices as the determinant (|∙|):

$${\text{ChPE}}\left( {A_{\text{P}} , \, M_{\text{O}} ,\lambda ,{\text{Mol}}} \right) \, = \, \left| {\lambda \cdot I\left( {A_{\text{P}} ,{\text{ Mol}}} \right) \, - \, C\left( {M_{\text{O}} ,{\text{ Mol}}} \right)} \right|$$
(1)

As any other polynomial formula, the evaluation point λ (or the argument of the polynomial) is to be used to evaluate the polynomial in a certain point of the domain of the values; it may take any real value. Equation (1) resembles the classical definition of the characteristic polynomial (ChP = |λ∙[Id] − [Ad]|, in which the identity [Id] and adjacency [Ad] matrices of the graph of the Mol molecule are here changed into [I(AP, Mol)] and [C(MO, Mol)], as it is detailed in Joiţa and Jäntschi (2017).

For a given molecule (Mol), each choice of the atomic property (AP ∈ {‘A,’ ‘B,’ ‘C,’ ‘D,’ ‘E,’ ‘F,’ ‘G,’ ‘H’}) and of the metric operator (MO ∈ {‘c,’ ‘C,’ ‘G,’ ‘G,’ ‘t,’ ‘T’}) provides a polynomial formula for the molecule.

The encodings for the atomic property (AP) provide specifications for the type of the element when the undistinguishable values of 1 from the main diagonal of the identity matrix [Id] are replaced with distinguishable values for each element as follows:

  • ‘A’ provides relative (to the last element of period 7, Uuo, A(Uuo) = 294) atomic mass;

  • ‘B’ provides the connection with the classical characteristic polynomial—always 1;

  • ‘C’—(atomic partial) charges, as available in the PubChem files;

  • ‘E’—electronegativity relative to Fluorine (4.00) on the Pauling (Pauling 1932) scale;

  • ‘F’—first ionization potential, relative to the potential of ionization for Hydrogen (1312 kJ/mol);

  • ‘G’—melting point temperature relative to diamond’s allotrope of Carbon (3820 K);

  • ‘H’—number of attached hydrogen atoms relative to the same for CH4 (4).

The encodings for the metric operator provides specifications for the bonds when undistinguishable values of 1 from the adjacency matrix ([Ad]) for the {‘c’, ‘g’, ‘t’} alternatives and from the distance matrix ([Di]) for the {‘C’, ‘G’, ‘T’} alternatives are replaced with distinguishable values as follows:

  • Inverse (in Å−1) of the geometrical distances (in Å) replaces 1′s in [Ad] (when MO = ‘g’) and in [Di] (when MO = ‘G’);

  • Inverse of the bond order replaces 1′s in [Ad] (when MO = ‘c’) and in [Di] (when MO = ‘C’);

  • Provides the connection with the classical characteristic polynomial (on [Ad], when MO = ‘t’) and on its extension on the distance matrix (on [Di], using its inversed values, when MO = ‘T’).

Thus, when MO ∈T{‘c,’ ‘g,’ ‘t’}, Eq. 1 can be rewritten as ChPE(AP, MO, λ, Mol) = |λ∙Id(AP, Mol) − Ad(MO, Mol)| and when MO ∈ {‘C’, ‘G’, ‘T’}, then ChPE(AP, MO, λ, Mol) = |λ∙Id(AP, Mol) − Di(MO, Mol)|, where I(AP, Mol), Ad(MO, Mol) and Di(MO, Mol) are functions which replace the values of 1 from identity ([Id]), adjacency ([Ad]) and distance ([Di]) matrices depending on the selected alternatives of AP and MO.

The extended characteristic polynomials can be computed on any value of the argument (λ), but like for the replacements of the values of 1, only the values from [− 1, 1] range provide contraction mappings (Jäntschi et al. 2016). The evaluation of the polynomial was made in 2001 equally spaced points from [− 1, 1], including thus − 1, 0 and 1 in this series of evaluation points. The name of the evaluated extended characteristic polynomial was given with eight characters, L1L2L3L4d1d2d3d4, where d1, d2, d3, and d4 are the digits of the representation of the number ranging from 0 to 1000 used to provide the equally spaced points of evaluation. The letters L1 to L4 have the following assignment: L4 is ‘N’ for negative λ arguments (from − 1.000 to − 0.001) and ‘P’ for nonnegative λ arguments (from 0.000 to 1.000), L3 encodes the connectivity (MO) alternatives, L2 encodes the identity (AP) alternatives, while L1 encodes a micro-linearization to macro-linearization alternative (‘I’ leaves the values unchanged, f(x) = x, ‘R’ provides reciprocal values, f(x) = 1/x, while ‘L’ provides the logarithm of the absolute values, f(x) = ln(|x|)). A total number of 288,144 (2001∙8∙6∙3) possible value-based representations of the structure result by this way. For a set of molecules, each valid (non-null) representation provides a structure descriptor.

Two alternatives were considered: to obtain a relationship expressing the solubility from calculated properties (the ones calculated with the Spartan’14 software) and to obtain a relationship expressing the solubility from the evaluations of the extended characteristic polynomial.

Linear regressions were involved, when the search was conducted with the following alternative models [Eq. 2 with one (x) descriptor, Eqs. 3 to 5 with two (x1 and x2) descriptors].

$$y \, \sim\hat{y} \, = \, a_{0} + \, a_{1} \cdot x\;{\text{or}}\;\hat{y} = a_{1} \cdot x$$
(2)
$$y\sim\hat{y} \, = a_{0} + \, a_{2} \cdot x_{1} \cdot x_{2} \;{\text{or}}\;\hat{y} \, = \, a_{2} \cdot x_{1} \cdot x_{2}$$
(3)
$$y\sim\hat{y} \, = a_{0} + a_{3} \cdot x_{1} + \, a_{4} \cdot x_{2} \;{\text{or}}\;\hat{y} = a_{3} \cdot x_{1} + a_{4} \cdot x_{2}$$
(4)
$$y\sim\hat{y} \, = a_{0} + \, a_{3} \cdot x_{1} + \, a_{4} \cdot x_{2} + a_{5} \cdot x_{1} \cdot x_{2} \;{\text{or}}\;\hat{y} = a_{3} \cdot x_{1} + a_{4} \cdot x_{2} + a_{5} \cdot x_{1} \cdot x_{2}$$
(5)

Equation (2) is simple linear regression, Eq. (4) is multiple (bi-varied) linear regression with only additive effects among descriptors, and Eq. (3) is multiple (bi-varied) linear regression with only multiplicative effects among descriptors, while Eq. (5) is multiple (bi-varied) linear regression quantifying for both additive and multiplicative effects among descriptors.

Adjusted determination coefficients allow selection of the best explanatory models, and their Fisher Z transformations (Fisher 1915, 1921) allow comparison among them.

The condition to use Eqs. (2)–(5) is that the dependent variable reconstitute a normal distribution. Otherwise, a series of transformations of the data are required (Bolboacă and Jäntschi 2013).

Results and discussion

On the 24 files containing different geometries of monosaccharides, the series of calculated properties are listed in Table 3 (Spartan’14 software, energy calculation with HF 6-31G* dual basis), and their associated thermodynamic quantities are listed in Table 4 (Spartan’14 software, computing IR parameters).

Table 3 Calculated molecular properties
Table 4 Calculated thermodynamic properties

Unfortunately, the solubility listed in Table 2 is a little departed from normality. A histogram of the values easily proves this (see Fig. 1).

Fig. 1
figure 1

Histogram of the solubilities from Table 2

The logarithm of the solubilities reconstitute the normality much better (see Fig. 2).

Fig. 2
figure 2

Histogram of the logarithm of the solubilities from Table 2

Because of very few data (only 8 values), it is difficult to do a test for normality. The alternatives are Anderson–Darling (AD) and Kolmogorov–Smirnov (KS) (Bolboacă and Jäntschi 2009). The KS test provides a value of 0.24358 (p = 0.644) when solubility is tested for normality and a value of 0.14593 (p = 0.985) when ln(solubility) is tested for normality. The AD test provides a value of 0.56634 (p = 0.656) when solubility is tested for normality and a value of 0.2579 (p = 0.949) when ln(solubility) is tested for normality. The p values show that the ln(solubility) is much closer to normality than the solubility. Therefore, the logarithm transformation was applied to the solubility.

Searching for regressions of type Eqs. (2)–(5) with ln(solubility) (as well as with solubility) as dependent variable and whole pool of properties listed in Tables 3 and 4 as independent variables with potential explanatory power was unsuccessful. Only a very poor association between ln(solubility) and SolvE (solvation energy) was identified (r = 0.32, r2 = 0.10, \(r_{\text{adj}}^{2}= 1-(1-r^2)(8-1)/(8-2)\) = − 0.05), definitely insufficient to be taken into account. Only its intercept is statistically significantly different from zero. A possible explanation is the fact that the solubility (and its logarithm) is almost orthogonal on the other properties (average of the correlation of ln(solubility) with the properties listed in Tables 3 and 4 is 0.05).

Therefore, seeking relationships estimating solubility from structure information provided by the extended characteristic polynomial seems appropriate and is definitely helpful. By applying Eqs. (2)–(5) were selected the best explanatory structure-based descriptors as the ones providing the best association defined by the models of the equations; their names and values for the compounds with known solubility are given in Table 5.

Table 5 Selected descriptors with highest explanatory power

In Table 5, Selection from column indicates which alternative (one of the equations: 2a, 2b, 3a, 3b, 4a, 4b, 5a, 5b) identified the descriptor as able to explain the ln(solubility).

As can be seen in Table 5, all selected descriptors came from geometry-based approach (‘G’ letter in the third position, when the adjacency matrix ([Ad]) is replaced with the distance matrix ([Di]) in the former definition of the characteristic polynomial (|λ∙Id(AP, Mol) − Di(MO, Mol)|). It may seem surprising, but is not, since the distance matrix brings further knowledge than the adjacency (let us remember that the 1′s are in the same position in the distance matrix as are in the adjacency; the change is on 0′s, and some of 0′s from adjacency are replaced with nonzero values based on the distances between the atoms). Also this selection of the geometry as the predominant metric explains why the properties derived from energetic calculations given in Tables 3 and 4 fail to estimate the solubility, and there are many geometrical arrangements sharing same energy. Regarding the atomic property, two of them were not selected in the best explanatory descriptors: number of the hydrogen atoms (‘H’) and relative atomic mass (‘A’), and it seems by this way that those have a little influence on the solubility. It may seem not quite logical, when we think about the dissociation process accompanying the dissolving, but if it is taken into account that each hydrogen is differently involved in this dissociation process, then it may seem reasonable that their number does not play an important role while the partial charges (‘C’) do.

Melting point temperature of the elements (‘G’) provides the first level of approximation for the association between the chemical structure and solubility (RGGN0477 = 1/GG(− 0.477) evaluated polynomial selected by Eq. 2a model in Table 5).

Splitting effects models [Eq. (3), additive, and Eq. (4), multiplicative] select both the same two atomic properties explaining the best association topology of the molecule (‘B’) and the first ionization potential (‘F’), and while taking into consideration both additive and multiplicative effects [in Eq. (5)], the pair of explanatory atomic properties is changed to electronegativity (‘E’) and partial charges (‘C’). Because the same pair of atomic properties was selected for both additive and multiplicative effects models, it seems reasonable to account for both effects, and therefore, the model selected by Eqs. (5) should be used for predictions [see the models listed in Table 6, where UV(λ) is the value of the extended characteristic polynomial UV in λ, when U encodes the atomic property (AP) and V encodes the distance metric (MO)].

Table 6 Models with the best explanatory power for solubility of monosaccharides

In Table 6, Equation indicates the type of the equation and it is one of the alternatives given as Eqs. (2)–(5), and Equation for ln(Solubility) gives the selected polynomials (GG selected by the equation of type 2a, BG and FG selected by Eq. 3a, and so on), their evaluation points (λ) which provides the best explanatory power for the model (λ = − 0.477 for GG polynomial in Eq. 2a, λ = 0.987 for BG and λ = 0.285 for FG polynomial for the selections made using Eq. 3a, and so on), as well as the meaning of the encodings for the names of the polynomials (GG polynomial was obtained using ‘G’ alternative for atomic properties—melting point temperature relative to diamond’s allotrope of Carbon, ‘G’ alternative for metric operator—calculations made using the Cartesian coordinates from geometric 3D model of the molecule) as well as the selected alternative for the connectivity ([C] = [Di], the distance matrix in the case of the model obtained using the equation of type 2a. Also Equation for ln(Solubility) gives the coefficients of the equations, as it is \(- 2.35_{ \pm 0.24}\) for Eq. (2a), where the value of − 2.35 is the intercept for the model of type 2a, given along with their 95% confidence intervals and at 5% risk being in error, the value of − 2.35 is subject to change with ± 0.24. The Statistics column gives the determination coefficient (r2) and adjusted (due to the size of the sample) one (\(r_{\text{adj}}^{2}\)).

The equation listed as entry 5a in Table 6 was used for estimation (on the compounds with known solubility) and for prediction (for the rest of the compounds). The results are given in Table 7. As can be seen in Table 7, the estimated values are close to the measured values for the solubility. The greatest departure is at CID 90008 (corresponding to d-psicose), and it is below 2%, which is safely in the range of the experimental error of the measurements on relative scales (5%).

Table 7 Experimental, estimated and predicted solubilities of monosaccharides

In Table 7, ln(Solubility) column contains the values of the ln(Solubility) as were estimated (for the first 8 monosaccharides) and, respectively, predicted (for the rest of them) by the equation listed as entry 5a in Table 6, while Solubility column contains the exponent of those values (exp(ln(solubility) = solubility) ready to be used.

The estimated values show a solubility of d-idose (CID 111123) and d-ribose (CID 5311110) similar to the solubility of d-psicose (CID 90008), while d-dihydroxyacetone (CID 670) seems to have the highest solubility (about 0.41 mol/mol, see Table 7). According to the literature, at 20 °C (not at 25 °C), d-dihydroxyacetone solubility is greater than 930 g/l (SCCS SCCS 2010; more than 10 mol/l) sustaining thus this estimation of its solubility for 25 °C. The second solubility among the selected monosaccharides seems to have d-altrose (about 0.35 mol/mol, see Table 7), but unfortunately no experimental data are available for comparison.

For glycolaldehyde (CID 756), d-erythrose (CID 94176), d-glyceraldehyde (CID 751), d-erythrulose (CID 5460177), d-threose (CID 439665), d-lyxose (CID 65550), d-ribulose (CID 151261) and d-xylulose (CID 5289590), the estimation provides a solubility similar to the solubility of d-xylose (CID 644160), about 0.13 mol/mol, which is likely since in water the monosaccharides suffer a complex process of mutarotation, leading to a series of different forms, as were shown in (Curtius et al. 1968).

Please note that in the absence of experimental data available, the results provided as predicted solubilities for monosaccharides (the last 16 entries in Table 7) are not validated. In order to be validated, further measurements on solubilities of monosaccharides must be conducted.

Conclusions

The experimental measurements of the solubilities of monosaccharides are scarce and rarely reported at standard conditions. Since solubility is of great importance for the biological role of the monosaccharides and the search for property–property relationships expressing the solubility was unsuccessful, a search for a structure–property relationship was conducted.

By using the extended characteristic polynomial, a structure–property relationship was found with great capacity of estimation of the solubility for the monosaccharides (\(r_{\text{adj}}^{2}\) = 0.9996). The relation suggests that the solubility of the monosaccharides is strongly dependent on the geometry and the atomic partial charges and the electronegativities of the elements play the main role in its expression. The relation was used to predict the solubility for 16 monosaccharides, when plausible solubilities were obtained.