Improved model for fullerene C60 solubility in organic solvents based on quantum-chemical and topological descriptors
Fullerenes are sparingly soluble in many solvents. The dependence of fullerene’s solubility on molecular structure of the solvent must be understood in order to manage efficiently this class of compounds. To find such dependency ab initio quantum-chemical calculations in combination with quantitative structure–property relationship (QSPR) tool were used to model the solubility of fullerene C60 in 122 organic solvents. A genetic algorithm and multiple regression analysis (GA-MLRA) were applied to generate correlation models. The best performance is accomplished by the four-variable MLRA model with prediction coefficient rtest2 = 0.903. This study reveals a correlation of highest occupied molecular orbital energy (HOMO), certain heteroatom fragments, and geometrical parameters with solubility. Several other important parameters of solvents that affect the C60 solubility have been also evaluated by the QSPR analysis. The employed GA-MLRA approach enhanced by application of quantum-chemical calculations yields reliable results, allowing one to build simple, interpretable models that can be used for predictions of C60 solubility in various organic solvents.
KeywordsC60 Fullerene Solubility QSPR DFT Quantum-chemical descriptors Modeling and simulation Predictive method
It is well known that the fullerenes are sparingly soluble in various solvents (Ruoff et al. 1993). This phenomenon has crucial consequences for both the basic studies on fullerenes and their industrial applications. The dependence of fullerene’s solubility on both molecular structure of the solvent and temperature must be understood in order to efficiently separate different members of the fullerene family from each other and from their precursors or derivatives (Korobov and Smith 2000). Though experimental data on fullerene solubility is available, there is still no reliable theory to explain (or predict) absolute values of fullerene solubility and variations in solubility when changing the solvent or using different fullerenes. This serious limitation is one of the reasons why separation of fullerenes on a large scale is still rather difficult without the aid of chromatography. Moreover, the data on solubility of the fullerene C60 in organic solvents could help to develop its various applications in chemistry and technology.
At present, the only approach allowing construction of predictive model of fullerene C60 solubility in different organic solvents is quantitative structure–property relationship (QSPR) based on molecular descriptors (i.e., parameters) calculated with molecular graphs of the solvents (Estrada and Gutman 1996; Simon 1987; Balaban 1983). The first attempt to explain the trends in C60 solubility was carried out by Sivaraman et al. (1992) who found a correlation between experimental fullerene’s solubility and calculated solubility parameter of solvent by studying fullerene’s solubility for 15 solvents. The same group also published another study devoted to fullerene solubility. For a series of solvents they plotted the fullerene solubility versus the Hildebrand solubility parameter (Sivaraman et al. 1994). In another study, Ruoff et al. (1993) studied the C60 solubility in 47 solvents. They examined the dependence of solubility on the polarizability, polarity, molecular size, and cohesive energy density (energy difference between solid phase and free atoms system) of the solvents. However, they found no distinctive parameter that universally explains the solubility of C60 and no predictive model was built in their study. Smith et al. (1996) employed the theoretical linear solvation energy approach with solubility dataset of 101 organic solvents. The correlation coefficient was r2 = 0.67 and statistically significant descriptors that were revealed in their study are molar volume, Lewis acidity, Lewis basicity, and electrostatic basicity. Thus, a good solvent for C60 should have a large molar volume and also be both a good Lewis acid and a good Lewis base. However, it should possess minimal polarity of the regions of negative charge. Though it is an interesting idea, the drawback of this study is too low correlation coefficient of the modeled relationship. In addition, a study has been published where authors aimed to apply molecular dynamics approach to model the solubility of fullerene in two solvents, water and carbon disulfide (Vanin et al. 2008). However, this approach is extremely time-consuming to apply it for large dataset of solvents and therefore not desirable to use in QSPR. There were also other articles regarding the solubility of C60 in different solvents (Marcus 1997, Marcus et al. 2001; Abraham et al. 2000; Hansen and Smith 2004; Kiss et al. 2000). The recent study in this field was published by Liu et al. (2005). The authors used the biggest to date dataset of 128 solvents. They built a model using least-squares support vector machine (LSSVM) method and the obtained correlation coefficients were r2 = 0.761 for the whole set and r2 = 0.861 for the 122 compounds (six compounds, i.e., outliers, removed). The correlation coefficients for the splitted dataset to training (92 compounds) and test (30 compounds) sets were 0.910 and 0.908, respectively. The authors used CODESSA software (Katritzky et al. 1994, 1995) to generate a set of descriptors and semiempirical quantum–mechanical approach to calculate quantum-chemical parameters for the considered solvents. They showed that such descriptors as Randic index (order 3), relative molecular weight, HOMO-1 energy (highest occupied molecular orbital), relative negative charge, average bonding information content, and average one-electron reactivity index for a C atom are involved in correlation model. In our recent study (Toropov et al. 2007a) devoted to solubility of C60 the QSPR model based on SMILES (Simplified Molecular Input line Entry System) notations and optimal descriptors was developed. This model allows to achieve correlation coefficient for the training set of r2 = 0.861 and for the external set of rtest2 = 0.890. The approach implemented in the current study differs considerably from the previous studies because in the earlier investigations only a very simple linear model with one complex parameter (correlation weight based on SMILES notation) was used that provided a good predictive ability. In addition, recently the authors of this study in collaboration with the experimental biology group published the study where they aimed to choose a better solvent for C60 fullerene in biological systems, based on cytotoxicity tests (Cook et al. 2010).
As it was pointed out above, there are many different approaches to calculate and predict C60 solubility in organic solvents. Some of them are fully mechanistic (Marcus 1997, Marcus et al. 2001; Abraham et al. 2000; Hansen and Smith 2004; Stukalin et al. 2003), developed from the thermodynamical point of view; others are statistically based, with good correlation coefficients, but not transparent and complicated in interpretation (Kiss et al. 2000; Liu et al. 2005; Toropov et al. 2007a, b, 2008, 2009). In this study, we aimed to find simple, transparent relationship and computationally fast approach, possibly mechanistically interpretable, to predict the solubility of C60 in various solvents. In addition, the study intends to estimate predictive potential of the topological descriptors and quantum-chemical parameters obtained by high level ab initio calculations in QSPR modeling of the fullerene C60 solubility in organic solvents.
The splitting of the experimental data into training and test sets for solubility of C60 in 122 different solvents was taken from ref (Liu et al. 2005), but originally these data were published previously in review paper (Beck and Mandi 1997). The solvents’ data are listed in Table S1 (see Supplementary material) and in Table 2. The solubilities are not given in weight units (e.g., mg/mL), but in terms of logarithmic values of molar fractions (log S) because the log S values correspond to the free energy changes in the solvation process. The data splitting into training and test sets was carried out according to refs. (Liu et al. 2005; Toropov et al. 2007c), i.e., to training set of 92 solvents and test set of 30 solvents, where it was splitted randomly.
Quantum-chemical calculations and statistical GA-MLRA approach
To find proper parameters that are responsible for C60 solubility in organic solvents we used two approaches: (1) structurally based additive descriptors calculation, and (2) quantum-chemically calculated descriptors. The set of descriptors obtained by the first approach represents the set of constitutional, topological, and molecular descriptors that were calculated by the DRAGON software (Todeschini and Consonni 2003). A set of 1,060 molecular descriptors of different types was used to describe the chemical diversity of the considered solvent compounds. The descriptor’s typology involves: (a) constitutional (atom and group fragments), (b) functional groups, (c) atom-centered fragments, (d) empirical, (e) topological, (f) walk counts, (g) various autocorrelations from the molecular graph, (h) Randic molecular profiles from the geometry matrix, (i) geometrical, (j) WHIM, and (k) GETAWAY descriptors, and various indicator descriptors. The meaning of these molecular descriptors and the calculation procedures are summarized elsewhere (Todeschini and Consonni 2000).
Considering the importance of the electronical molecular properties for QSAR/QSPR, the additional descriptors were calculated by the second approach—the quantum-chemical descriptors (Hehre et al. 1986; Becke 1993). Quantum-chemical descriptors have been calculated using Gaussian 03 software (Frisch et al. 2004) by Density Functional Theory (DFT) methodology. The global minimum-energy conformation was identified for each molecule. A combination of Becke’s three-parameter adiabatic connection exchange functional with Los Alamos National Lab effective core potential with double-zeta basis set for valence electrons (B3LYP/LANL2DZ) was employed in order to obtain reliable energetics and accurate data on electronic properties of the considered molecules. Some of considered solvents contain Br and I atoms and therefore LANL2DZ basis set has been applied. The LANL2DZ basis set has been developed for predictions of heavy atoms, beyond the third row (Foresman and Frisch 2000). Overall, different kinds of quantum-chemical descriptors were used based on performed calculations, including dipole moments (total dipole moment, X, Y, and Z components); orbital energies, EHOMO, ELUMO; HLgap (gap between EHOMO and ELUMO), and finally, heats of formation and ionization potentials.
The correlation coefficients for all pairs of descriptor variables used in the applied models were evaluated in order to identify highly correlated descriptors and to avoid redundancy in the data set. Any type of redundancy might lead to an overexploitation of a chemical property in the explanation of the dependent variable. Hence, some highly correlated and constant descriptors (cross-correlation r2 > 0.9) were removed from the further consideration. Furthermore, at the process of each model building (i.e., inside of each final model), the descriptors with cross-correlation coefficient larger than 0.6 are avoided.
The correlation between solubility and structural properties was obtained by using a variable selection Genetic Algorithm (GA) and MLRA methods. GA has been applied in recent years as a powerful tool to address many problems in drug design (Davis 1991; Devillers 1996). We applied GA to select from the set of all calculated descriptors only the best combinations of those the most relevant for obtaining models with the highest predictive power of solubility. Overall, the combined GA-MLRA technique was utilized to select the appropriate descriptors and to generate different QSPR models. The GA technique started with a population of 30 random models and was carried out up to 5500 iterations for evolution, mutation—35%, fitness/scoring function—correlation coefficient (r). For GA analysis and the derivation of the QSPR models, the BuildQSAR program (de Oliveira and Gaudio 2001) was used. A final set of QSARs was validated by applying the “leave-one-out” technique with its predicting ability being evaluated and confirmed by cross-validation coefficient q2.
Statistical characteristics for the selected models with one to five descriptors
Training set (n = 92)
Test set (n = 30)
X1sol, HOMO, J3D
X1sol, TI2, FDI, H-052
GAP, AMW, X3, FDI, nHacc
The solvents’ names and CAS numbers for all used compounds and also descriptors’ names, which have been involved in all best models, are showed in the Table S1 of Supplementary material. One can see from the data in Table 1 that the four-variable model (Eq. 4) yields the best predictive potential for the solubility of C60 in organic solvents. While constructing the models, great care was taken in order to avoid inclusion of highly collinear descriptors. All collinear descriptors were eliminated from the further consideration.
List of training and test sets, experimental and calculated with Eq. 4 values of solubility of fullerene C60 in organic solvents
log Sexpr − log Scalc
As it can be seen, all models include mostly well explainable and transparent descriptors, including quantum-chemical parameters.
The first four models include topological descriptor X1sol, which represents solvation connectivity index (chi-1) that encodes the solvation property of the compound (Todeschini and Consonni 2003). This molecular descriptor is defined in order to model solvation entropy and dispersion interactions in solution. The descriptor relates the characteristic dimension of the molecule to the atomic parameters (quantum number, bond indexes, etc.). The bidimensional descriptor X1sol was proposed in 1991 by the group of Zefirov and Palyulin (Antipin et al. 1991) in order to treat the enthalpies of non-specific solvation. For instance, as it was described in this study (Duchowicz et al. 2008), the solvation enthalpy of propane (CH3CH2CH3) and dimethyl-mercury (CH3HgCH3) differs enormously, but both of these molecules are represented by the same hydrogen depleted graph, and, hence, they have the identical topological indexes, which do not take into account the atom types. The solvation index was created exactly to differentiate such cases, providing the general formula (see below) when calculated for hydrogen- and fluorine-depleted molecular graphs.
The two- and three-variable models (Eqs. 2, 3) include J3D descriptor that encodes heteroatoms' content and bond multiplicity. This descriptor improves the performance of two-variable model greatly, from r2 = 0.6 for one-variable model to almost 0.8 for the two-variable one. The J3D descriptor is not cross-correlated with X1sol descriptor. It encodes different features and only informationally supplements the X1sol descriptor as well as amplifies the model performance. The J3D descriptor indicates the importance of heteroatoms' content in solvent structure, because the presence of heteroatoms dramatically changes the solubility property of the compound. Although both X1sol and J3D descriptors partially encode the presence of heteroatoms in the molecule, they work independently and only improve the performance of each other.
Looking for the best performance model that works better for the prediction of the test set values we generated four- and five-variable models. We finally establish that four-variable model has the best performance, its predictive r2 value for the test set is 0.903, including all compounds (Table 1, Fig. 1). The model 5 (Eq. 5) shows much worse performance for the test set (most) probably because of overfitting.
Rather interesting descriptor, involved in Eq. 4 is H-052, the number of hydrogens at the considered molecular fragment. One concludes that large number of substituents at the carbon atoms that are next to carbon atom with heteroatom results in lower solubility. But this statement is only exact for solvents that do not have full substitution at the neighbor carbon atoms, i.e., where there are still 1–4 hydrogen atoms. Interestingly, in case of unsubstituted neighbor carbon atoms the solubility in such solvents became lower (H-052 = 6). Furthermore, the solvents with fully substituted carbon atoms (no hydrogens at the neighbors around the carbon atom with heteroatom) are the best solvents for C60, and their behavior differs from the others, hydrogen-containing solvents.
As it was showed above, all descriptors that are included in the models are quite simple and transparent. They can easily be calculated using available software, and by using the developed models the predictive value of C60 solubility in a particular solvent can be calculated. This is one of the advantages of the models obtained in this study, comparing to Liu et al. (2005) investigation, where models are not represented and therefore can not be reproduced to predict solubility for new solvents.
Overall, among all models generated the four-variable model is a good candidate for further use for C60 solubility predictions. It has a good predictive power, transparent descriptors and includes only easy-to-calculate descriptors. The contribution and structural influence of each descriptor involved in four-variable model is discussed above.
In this study we have compared the performances of one-, two-, three- four- and five-variable models constructed by GA-MLRA approach for prediction of solubility of C60. The four-variable model provides the best predictive ability (rtest2 = 0.903), while the other models give only satisfactory prediction values. For one-, two-, and three-variable models it can be explained by insufficient information providing by the small number of descriptors in the model, and for five-variable model it is due to its overfitting that leads to bad prediction.
The study reveals a correlation of HOMO energy (HOMO–LUMO gap), heteroatom fragments and geometrical parameters with solubility. Although the X1sol descriptor alone displayed the statistically not so significant correlation, it still provides a main contribution to solubility. The presence of quantum-chemical descriptor—HOMO energy confirmed the importance of nucleophilic properties of solvents for solubility of C60. Higher HOMO value for solvent results in the higher solubility of C60. Also, various distribution behavior of HOMO–LUMO gap energies for different types of solvents have been concluded. Another descriptor, which encodes number of hydrogens at certain fragments (H-052), indicates that the solvents with fully substituted carbon atoms (no hydrogens at the neighbors around the carbon atom with heteroatom) are the best solvents for C60, and their behavior differs from the others, hydrogen-containing solvents. Surprisingly, the folding descriptor FDI showed certain correlation with solubility, asserting that the closer structure of solvent to linearity the higher solubility of C60 in this solvent is. Several other important parameters of solvents that affect C60 solubility also have been discussed based on the QSPR analysis. They lead to better understanding of the solvent structure required for improving of the fullerene solubility.
This study demonstrates that an application of the GA-MLRA technique in combination with quantum-chemical and topological descriptors yields reliable models. They are quite simple, interpretable, transparent, and comparable to the previously published results. The best performance is accomplished by the four-variable MLRA model with prediction coefficient rtest2 = 0.903. An applied approach based on topological and quantum-chemical data provides the model for fullerene C60 solubility which is comparable with the model from the previous study (Liu et al. 2005) and model suggested in the recently published paper (Toropov et al. 2007c), in addition of having an advantage of being transparent and mechanistically interpretable. These conclusions allow us to believe that the constructed models can be used for future predictions of C60 solubility in various organic solvents (for industry and laboratory experiments) and that they provide basics for understanding of this phenomenon.
The authors would like to thank for support the National Science Foundation for the “Interdisciplinary Center for Nanotoxicity” support—NSF HRD #0833178; for NSF EPSCoR Grant no. 362492-190200-01\NSFEPS-0903787 and Department of Defense through the U. S. Army Engineer Research and Development Center (Vicksburg, MS) for the grant “Development of Predictive Techniques for Modeling Properties of NanoMaterials Using New OSPR/QSAR Approach Based on Optimal NanoDescriptors”—Contract #W912HZ-06-C-0061. The authors are grateful to the Mississippi Center for Supercomputing Research (MCSR) for providing state-of-the-arts high performance computing facilities and excellent services for supporting this research.
- Abraham MH, Green CE, Acree WE (2000) Correlation and prediction of the solubility of Buckminsterfullerene in organic solvents; estimation of some physicochemical properties. J Chem Soc Perkin Trans 2:281–286Google Scholar
- Antipin IS, Arslanov NA, Palyulin VA, Konovalov AI, Zefirov NS (1991) Solvation topological index. Topological description of dispersion interaction (in Russian). Dokl Akad Nauk SSSR 316:925–927 (Chem Abstr 115, 91390)Google Scholar
- Beck MT, Mandi G (1997) Solubility of C60. Fuller Sci Technol 5:291–310Google Scholar
- Davis L (1991) Handbook of genetic algorithms. Van Nostrand Reinhold, New York, USAGoogle Scholar
- Devillers J (1996) Genetic algorithms in molecular modeling. Academic Press Ltd, LondonGoogle Scholar
- Estrada E, Gutman I (1996) A topological index based on distances of edges of molecular graphs. J Chem Inf Comp Sci 36:850–853Google Scholar
- Foresman JB, Frisch A (2000) Exploring chemistry with electronic structure methods. Gaussian, Inc., PittsburghGoogle Scholar
- Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, Cheeseman JR et al (2004) Gaussian 03, Revision C.02. Gaussian, Inc., WallingfordGoogle Scholar
- Hehre WJ, Radom L, Schleyer P, Pople JA (1986) Ab initio molecular orbital theory. Wiley, New YorkGoogle Scholar
- Katritzky AR, Lobanov VS, Karelson M (1994) Comprehensive descriptors for structural and statistical analysis. Version 2.0. Semichem, Inc., Gainesville (reference manual)Google Scholar
- Korobov MV, Smith AL (2000) Solubility of the fullerenes. In: Kadish KM, Ruoff RS (eds) Fullerenes: chemistry, physics, and technology. John Wiley and Sons Inc, New York, pp 53–90Google Scholar
- Marcus Y (1997) Solubilities of buckminsterfullerene and sulfur hexafluoride in various solvents. J Phys Chem 101:8617–8623Google Scholar
- Mohar B (1989) Laplacian matrices of graphs. In: Graovac A (ed) MATH/CHEM/COMP 1988. Studies in physical and theoretical chemistry, vol 63, Elsevier, Amsterdam, pp 1–8Google Scholar
- Randic M, Kleiner AF, De Alba LM (1994) Distance/distance matrices. J Chem Inf Comp Sci 34:277–286Google Scholar
- Sivaraman N, Dhamodaran R, Kaliappan I, Srinivasan TG, Rao PRV, Mathews CK (1994) Solubility of C60 and C70 in organic solvents. In: Kadish KM, Ruoff RS (eds) Recent advances in the chemistry and physics of fullerenes and related materials. The Electrochemical Society, Pennington, pp 156–165Google Scholar
- Smith AL, Wilson LY, Famini GR (1996) A quantitative structure-property relationship study of C60 solubility. In: Recent advances in the chemistry and physics of fullerenes and related materials, vol 3. Proceedings of Electrochemical Society, Philadelphia, pp 53–62Google Scholar
- Todeschini R, Consonni V (2003) DRAGON software for the calculation of molecular descriptors Version 3.0Google Scholar