Structure–property relationships for solubility of monosaccharides

A series of difficulties make inaccessible precise experimental determinations of solubilities in standard conditions for monosaccharides; in water, the monosaccharides may switch from the acyclic to cyclic form and also the cyclic forms can undergo mutarotation. There are many ways to express the structural information as structure descriptors, but the alternatives become fewer when looking for invariants with good identification abilities, and the characteristic polynomial is one of them. The disadvantage of the characteristic polynomial resides in the fact that is defined with disregard of the chemical information coming from the type of the element and the type of the bond. Here, an extension of the characteristic polynomial was used accounting for the chemical information. In water, monosaccharides exist in all forms, but only one is an invariant for all the acyclic form. If it is something in structure which associates the structure information with the solubility, then it is present in all its form including the acyclic one, and therefore, the acyclic forms can be used to derive structure–property relationships. A search for linear relationships expressing the solubility as a function of the structure of the acyclic forms of monosaccharides was conducted by using the extension of the characteristic polynomial. The search used the experimental data available to select the models that are able to estimate the solubility, with each different to the other in terms of the effects considered. Considering the obtained results, the extended characteristic polynomial provides a very good estimation capability for the solubilities of monosaccharides.


Introduction
Sugars are short-chain carbohydrates, with their molecule consisting of carbon (C), hydrogen (H) and oxygen (O) atoms with the general formula C m (H 2 O) n where 2 ≤ m (and usually 3 ≤ m ≤ 7) and n ≤ m (and usually n = m or n = m − 1).
Monosaccharides are the simplest carbohydrates, with general formula (CH 2 O) n , where n ranges from 2 (diose, H-(C=O)-(CH 2 )-OH) to usually 7 (when n = 3 are trioses, when n = 4 are tetroses, when n = 5 pentoses, when n = 6 hexoses and when n = 7 heptoses). There are 24 monosaccharides (see Table 1) from diose (n = 2) to hexoses (n = 6). The monosaccharides with lower number of atoms (e.g., n = 3 and n = 4) may cyclize by dimerization leading to cyclic monosaccharides with n = 6 and n = 8, respectively, and also, the monosaccharides can join together to form disaccharides. A disaccharide is formed whenever two monosaccharides (identical or not) joined. Two identical monosaccharides can form up to eleven different disaccharides (Paulus and Klockow-Beck 1999), and the numbers increase even more abruptly when different monosaccharides are connected [Schmidt (1986) counted 720 trisaccharides, 34,560 tetrasaccharides and 2,144,640 pentasaccharides] and the consequence is an enormous diversity and complexity in carbohydrate structure and chemistry.
In Table 1, 'n = ' stands for the n from the general formula (CH 2 O) n of the monosaccharides, while the split into aldoses and ketoses is based on the position of the double-bonded oxygen in their structure.
Actually, there are very few experimental data on solubilities of monosaccharides reported at IUPAC SATP conditions. Gray et al. (2003) reported solubilities for five monosaccharides, while (Teles et al. 2016) have the same five ones.
In addition, the recent literature is abundant of studies in connection with the solubility of monosaccharides and of a growing interest is not only their solubility in water, but also in water-based solvents and solvent mixtures. Thus, Ye et al. (2017) reported solubility data for butanol-water mixtures, while others reported for ionic aqueous solvent mixtures including water + NaCl (Hernández-Luis et al. 2003;Ghalami-Choobar et al. 2015), water + NaBr (Zhuo et al. 2005), water + NaI (Zhuo et al. 2008), water + LiCl, water + KH 2 PO 4 and water + NaC 6 H 11 O 7 (Banipal et al. 2014a(Banipal et al. , b, 2015, water + 3-hydroxypropylammonium acetate , water + 1-hexyl-3-methyl imidazolium chloride (Zafarani-Moattar et al. 2017) and even in a series of ionic aqueous solvent mixtures (Carneiro et al. 2013). Also, the solubility of other compounds in mixtures of water + monosaccharides is of interest as well-Nain (2016) reporting data for water + d-mannose solvent mixtures. Solubility-related recent studies include solid-liquid and vapor-liquid phases equilibrium data for some monosaccharides as well as some disaccharides (Jónsdóttir et al. 2002) and solid-liquid phase equilibrium data for binary and ternary systems of certain monosaccharides and water (Guo et al. 2017).
The main problem in conducting studies relating the experimental measurements on carbohydrates is the scarcity of structural information from combined factors (difficulties to crystallize and the limitations in NMR analysis (Zwahlen and Vincent 2002)). Another challenge is the fact that usually the researchers conducting the structural determinations are not the same with the ones conducting the property measurements, and by this way, the reliability of the data sources is reduced, since very easily during the experimental treatment, the monosaccharides may switch from the acyclic to cyclic form and the cyclic forms can undergo mutarotation.
Other data which may be of interest reported for monosaccharides include equilibrium constants of their complexes, but also here the information available is scarce; Hacket et al. (1997) reports the equilibrium constants of D-fructose When applies, the names are *ose for acyclic and *opyranose for cylic forms (e.g. Glucose -acyclic Glucopyranose -cyclic) complexes between β-cyclodextrin and three out of the 24 monosaccharides listed in Table 1, obviously not enough paired data to do an analysis in the series. The increasing interest of series-based data including the monosaccharides is confirmed by a recent study of Buttersack (2017) which reports data not only for monosaccharides (most of pentoses and all hexoses) which estimated hydrophobicity by direct measurement of the hydrophobic interaction of carbohydrates and other hydroxy compounds with a C18-modified silica gel column.
The solubilities in standard conditions for 8 out of the 24 monosaccharides listed in Table 1 were involved in this study to obtain relations expressing the solubility as a function of the structure of monosaccharides in their acyclic form.

Material and method
The available data about solubilities of 8 monosaccharides (listed in Table 2) were collected from the literature. For one (fructose) was necessary an extrapolation, while for other two (allose and psicose) a conversion of the units was made.
In water exists all forms but only one is 'unique'-e.g., does not have different conformation states-the acyclic form, and this is the reason for which it was used. If it is something in structure which explains its behavior in water, then it is present in all its forms including the acyclic one. The advantage of using the acyclic form is given by its uniqueness, which allows to do the desired inference in the whole set of structures.
The structural information as 3D geometries for the D-type isomers of acyclic forms was taken from PubChem database (CID numbers of the files given in Table 2). For one monosaccharide (CID 111123 corresponding to D-idose), the 3D geometry were built from its 2D geometry. On the 24 files containing different geometries of monosaccharides were calculated properties using Spartan'14 software in the following configuration: energy calculation with Hartree-Fock (HF) method, 6-31G* dual basis (Steele and Head-Gordon 2010); the infrared (IR) parameters (Pople et al. 1989) were computed too and thermodynamic entities were derived (C V -molar heat capacity at constant volume H-enthalpy, S-entropy, G-free enthalpy, C V 0 , H 0 , S 0 , G 0 at 298.15 K and S 0K , C V 0K , ZPE-zero point energy-all at 0 K). 1-From (Gray et al. 2003); 2-converted from (Kozakai et al. 2015), 47 wt%; 3-extrapolated at 79.3 wt% (from 30°, 40° and 50° to 25°) from (Flood et al. 1996); 4-converted from (Fukada et al. 2010 On the CID files containing the structures listed in Table 2 were calculated the extended characteristic polynomials. Since the procedure of calculation for the extended polynomial is detailed elsewhere Joiţa and Jäntschi, 2017), here it is given only in brief. The extended characteristic polynomial (ChPE) is calculated on the chemical structure containing only the heavy elements (without hydrogen atoms). The extended characteristic polynomial (ChPE) on a molecule (Mol) on the evaluation point λ is calculated depending on the choice of the atomic property (A P ) and of the metric operator (M O ) by using the identity (I) and the connectivity (C) matrices as the determinant (|•|): As any other polynomial formula, the evaluation point λ (or the argument of the polynomial) is to be used to evaluate the polynomial in a certain point of the domain of the values; it may take any real value. Equation (1)  The encodings for the atomic property (A P ) provide specifications for the type of the element when the undistinguishable values of 1 from the main diagonal of the identity matrix [Id] are replaced with distinguishable values for each element as follows: • 'A' provides relative (to the last element of period 7, Uuo, A(Uuo) = 294) atomic mass; • 'B' provides the connection with the classical characteristic polynomial-always 1; • 'C'-(atomic partial) charges, as available in the PubChem files; • 'E'-electronegativity relative to Fluorine (4.00) on the Pauling (Pauling 1932) scale; • 'F'-first ionization potential, relative to the potential of ionization for Hydrogen (1312 kJ/mol); • 'G'-melting point temperature relative to diamond's allotrope of Carbon (3820 K); • 'H'-number of attached hydrogen atoms relative to the same for CH 4 (4).
The encodings for the metric operator provides specifications for the bonds when undistinguishable values of 1 from The extended characteristic polynomials can be computed on any value of the argument (λ), but like for the replacements of the values of 1, only the values from [− 1, 1] range provide contraction mappings . The evaluation of the polynomial was made in 2001 equally spaced points from [− 1, 1], including thus − 1, 0 and 1 in this series of evaluation points. The name of the evaluated extended characteristic polynomial was given with eight characters, L 1 L 2 L 3 L 4 d 1 d 2 d 3 d 4 , where d 1 , d 2 , d 3 , and d 4 are the digits of the representation of the number ranging from 0 to 1000 used to provide the equally spaced points of evaluation. The letters L 1 to L 4 have the following assignment: L 4 is 'N' for negative λ arguments (from − 1.000 to − 0.001) and 'P' for nonnegative λ arguments (from 0.000 to 1.000), L 3 encodes the connectivity (M O ) alternatives, L 2 encodes the identity (A P ) alternatives, while L 1 encodes a micro-linearization to macro-linearization alternative ('I' leaves the values unchanged, f(x) = x, 'R' provides reciprocal values, f(x) = 1/x, while 'L' provides the logarithm of the absolute values, f(x) = ln(|x|)). A total number of 288,144 (2001•8•6•3) possible value-based representations of the structure result by this way. For a set of molecules, each valid (non-null) representation provides a structure descriptor.
Two alternatives were considered: to obtain a relationship expressing the solubility from calculated properties (the ones calculated with the Spartan'14 software) and to obtain a relationship expressing the solubility from the evaluations of the extended characteristic polynomial.
Linear regressions were involved, when the search was conducted with the following alternative models [Eq. 2 with one (x) descriptor, Eqs. 3 to 5 with two (x 1 and x 2 ) descriptors].
Equation (2) is simple linear regression, Eq. (4) is multiple (bi-varied) linear regression with only additive effects among descriptors, and Eq. (3) is multiple (bi-varied) linear regression with only multiplicative effects among descriptors, while Eq. (5) is multiple (bi-varied) linear regression quantifying for both additive and multiplicative effects among descriptors.
Adjusted determination coefficients allow selection of the best explanatory models, and their Fisher Z transformations (Fisher 1915(Fisher , 1921 allow comparison among them.
The condition to use Eqs.
(2)-(5) is that the dependent variable reconstitute a normal distribution. Otherwise, a series of transformations of the data are required (Bolboacă and Jäntschi 2013).

Results and discussion
On the 24 files containing different geometries of monosaccharides, the series of calculated properties are listed in Table 3 (Spartan'14 software, energy calculation with HF 6-31G* dual basis), and their associated thermodynamic quantities are listed in Table 4 (Spartan'14 software, computing IR parameters).
Unfortunately, the solubility listed in Table 2 is a little departed from normality. A histogram of the values easily proves this (see Fig. 1).
The logarithm of the solubilities reconstitute the normality much better (see Fig. 2).
Because of very few data (only 8 values), it is difficult to do a test for normality. The alternatives are Anderson-Darling (AD) and Kolmogorov-Smirnov (KS) (Bolboacă and Jäntschi 2009). The KS test provides a value of 0.24358 (p = 0.644) when solubility is tested for normality and a value of 0.14593 (p = 0.985) when ln(solubility) is tested for normality. The AD test provides a value of 0.56634 (p = 0.656) when solubility is tested for normality and a value of 0.2579 (p = 0.949) when ln(solubility) is tested for normality. The p values show that the ln(solubility) is much closer to normality than the solubility. Therefore, the logarithm transformation was applied to the solubility.
Searching for regressions of type Eqs.
(2)-(5) with ln(solubility) (as well as with solubility) as dependent variable and whole pool of properties listed in Tables 3 and 4 as independent variables with potential explanatory power was unsuccessful. Only a very poor association between ln(solubility) and SolvE (solvation energy) was identified (r = 0.32, r 2 = 0.10, r 2 adj = 1 − (1 − r 2 )(8 − 1)∕(8 − 2) = − 0.05), definitely insufficient to be taken into account. Only its intercept is statistically significantly different from zero. A possible explanation is the fact that the solubility (and its logarithm) is almost orthogonal on the other properties (average of the correlation of ln(solubility) with the properties listed in Tables 3 and 4 is 0.05).
Therefore, seeking relationships estimating solubility from structure information provided by the extended characteristic polynomial seems appropriate and is definitely helpful. By applying Eqs. (2)-(5) were selected the best explanatory structure-based descriptors as the ones providing the best association defined by the models of the equations; their names and values for the compounds with known solubility are given in Table 5.
As can be seen in Table 5, all selected descriptors came from geometry-based approach ('G' letter in the third position, when the adjacency matrix ([Ad]) is replaced with the distance matrix ([Di]) in the former definition of the characteristic polynomial (|λ•Id(A P , Mol) − Di(M O , Mol)|). It may seem surprising, but is not, since the distance matrix brings further knowledge than the adjacency (let us remember that the 1′s are in the same position in the distance matrix as are in the adjacency; the change is on 0′s, and some of 0′s from adjacency are replaced with nonzero values based on the distances between the atoms). Also this selection of the geometry as the predominant metric explains why the properties derived from energetic calculations given in Tables 3 and 4 fail to estimate the solubility, and there are many geometrical arrangements sharing same energy. Regarding the atomic property, two of them were not selected in the best explanatory descriptors: number of the hydrogen atoms ('H') and relative atomic mass ('A'), and it seems by this way that those have a little influence on the solubility. It may seem not quite logical, when we think about the dissociation process accompanying the dissolving, but if it is taken into account that each hydrogen is differently involved in this dissociation process, then it may seem reasonable that their number does not play an important role while the partial charges ('C') do.
Melting point temperature of the elements ('G') provides the first level of approximation for the association between the chemical structure and solubility (RGGN0477 = 1/ GG(− 0.477) evaluated polynomial selected by Eq. 2a model in Table 5).
Splitting effects models [Eq.
(3), additive, and Eq. (4), multiplicative] select both the same two atomic properties explaining the best association topology of the molecule ('B') and the first ionization potential ('F'), and while taking into consideration both additive and multiplicative effects [in Eq. (5)], the pair of explanatory atomic properties is changed to electronegativity ('E') and partial charges ('C'). Because the same pair of atomic properties was selected for both additive and multiplicative effects models, it seems reasonable to account for both effects, and therefore, the model selected by Eqs. (5) should be used for predictions [see the models listed in Table 6, where UV(λ) is the value of the extended characteristic polynomial UV in λ, when U encodes the atomic property (A P ) and V encodes the distance metric (M O )].
In Table 6, Equation indicates the type of the equation and it is one of the alternatives given as Eqs.
(2)-(5), and Equation for ln(Solubility) gives the selected polynomials (GG selected by the equation of type 2a, BG and FG selected by Eq. 3a, and so on), their evaluation points (λ) which provides the best explanatory power for the model (λ = − 0.477 for GG polynomial in Eq. 2a, λ = 0.987 for BG and λ = 0.285 for FG polynomial for the selections made using Eq. 3a, and so on), as well as the meaning of the encodings for the names of the polynomials (GG polynomial was obtained using 'G' alternative for atomic properties-melting point temperature relative to diamond's allotrope of Carbon, 'G' alternative for metric operatorcalculations made using the Cartesian coordinates from geometric 3D model of the molecule) as well as the selected alternative for the connectivity ([C] = [Di], the distance matrix in the case of the model obtained using the equation of type 2a. Also Equation for ln(Solubility) gives the coefficients of the equations, as it is −2.35 ±0.24 for Eq. (2a), where the value of − 2.35 is the intercept for the model of type 2a, given along with their 95% confidence intervals and at 5% risk being in error, the value of − 2.35 Table 3 Calculated molecular properties CID compound identifier from PubChem; Conf number of conformers (= 9 n−1 ); LUMO lowest unoccupied molecular orbital energy (in eV); HOMO highest occupied molecular orbital energy (in eV); DM dipole moment (in Debye); Energy total energy (in a.u.; a.u. = Hartrees); Energy aq. total energy in solvated form at infinite dilution (in a.u.; a.u. = Hartrees); Solv_E solvation energy (from SM5.4 Cramer-Truhlar methodsee (Chambers et al. 1996), in kJ/mol). Please note that the solvation energy gives only an estimate, and thus, a possibly better alternative is to account for the number of surrounding water molecules  The thermodynamic quantities from Table 4 are given for IUPAC STAP (T 0 = 298.15 K): H 0 (in a.u.; a.u. = Hartrees), G 0 (in a.u.; a.u. = Hartrees), S 0 (in J mol −1 K −1 ), C V 0 (in J•mol −1 •K −1 ) and at 0 K: ).
The equation listed as entry 5a in Table 6 was used for estimation (on the compounds with known solubility) and for prediction (for the rest of the compounds). The results are given in Table 7. As can be seen in Table 7, the estimated    values are close to the measured values for the solubility. The greatest departure is at CID 90008 (corresponding to d-psicose), and it is below 2%, which is safely in the range of the experimental error of the measurements on relative scales (5%). In Table 7, ln(Solubility) column contains the values of the ln(Solubility) as were estimated (for the first 8 monosaccharides) and, respectively, predicted (for the rest of them) by the equation listed as entry 5a in Table 6, while Solubility column contains the exponent of those values (exp(ln(solubility) = solubility) ready to be used.
The estimated values show a solubility of d-idose (CID 111123) and d-ribose (CID 5311110) similar to the solubility of d-psicose (CID 90008), while d-dihydroxyacetone (CID 670) seems to have the highest solubility (about 0.41 mol/mol, see Table 7). According to the literature, at 20 °C (not at 25 °C), d-dihydroxyacetone solubility is greater than 930 g/l (SCCS SCCS 2010; more than 10 mol/l) sustaining thus this estimation of its solubility for 25 °C. The second solubility among the selected monosaccharides seems to have d-altrose (about 0.35 mol/mol, see Table 7), but unfortunately no experimental data are available for comparison.
Please note that in the absence of experimental data available, the results provided as predicted solubilities for monosaccharides (the last 16 entries in Table 7) are not validated. In order to be validated, further measurements on solubilities of monosaccharides must be conducted.

Conclusions
The experimental measurements of the solubilities of monosaccharides are scarce and rarely reported at standard conditions. Since solubility is of great importance for the biological role of the monosaccharides and the search for property-property relationships expressing the solubility was unsuccessful, a search for a structure-property relationship was conducted. By using the extended characteristic polynomial, a structure-property relationship was found with great capacity of estimation of the solubility for the monosaccharides ( r 2 adj = 0.9996). The relation suggests that the solubility of the monosaccharides is strongly dependent on the geometry and the atomic partial charges and the electronegativities of the elements play the main role in its expression. The relation was used to predict the solubility for 16 monosaccharides, when plausible solubilities were obtained.