Introduction

The quantitative structure-property relationship (QSPR) studies use the statistical models to estimate the various properties of the chemical compounds from its molecular structure [114]. In QSPR studies, the structure is often represented by different structural descriptors. Among the various structural parameters applied to QSPR analysis, the topological indices are often used in modeling physical, chemical, or biological properties [514]. The first applications of topological indices in structure-property relationship studies was proposed by Wiener in 1947 [15] and later in the 1975 by Randic[16]. The generalization of the Randic index are the Kier and Hall molecular connectivity indices [10]. The molecular connectivity indices contain considerable information about the structure of the molecule. Kier and Hall [10]widely described the information encoded by molecular connectivity indices especially on thetopological but also the electronic properties of the molecule.

Gemini surfactants consist of two hydrophobic tails and two hydrophilic heads connected by the spacer group. Due to the binding together of two conventional surfactant molecules by the spacer, these compounds have very good properties in aqueous solution. The cmc values of these surfactants are significantly lower than those of the corresponding monomeric counterparts.

In the previous paper [12], the QSPR study was performed to derive the model which relates the critical micelle concentration of gemini surfactants to their structure. The relationship was developed for a set of 21 cationic (bromide) gemini surfactants employing the molecular connectivity indices only. The previous model contains the second-order molecular connectivity index which, as was suggested in [12], probably encodes the information about the flexibility.

In the present study, the 4models were derived. The relationships were developed for a set of 23 cationic (chloride) gemini surfactants also employing only the molecular connectivity indices. Just as in the previous study [12], the present models were derived for the molecules of various structures, i.e., the effect of all groups of the molecule on cmc value was taken into account. The structure of the investigated compounds are quite different from the previous bromides. Also, the test compounds differ in structure from previously studied compounds. The present study confirms that the one-descriptor model which best estimates the cmc values is that which contains the second-order molecular connectivity index, but the further analysis showed that the model which contains the first-order valence molecular connectivity index better describes the changes of cmc values of cationic (chloride) gemini surfactants caused by structure modification.

Materials and methods

Dataset

The data set contains only gemini surfactants with chlorides as counterions. The compounds were chosen to contain gemini surfactants with medium and long spacer length. The chemical structures of the investigated compounds along with their abbreviations are presented in Fig. 1. The data set contains 23 compounds of training set and 2test compounds. The chemical structures of the surfactants and the experimental values of cmc were taken from literature [1721].

Fig. 1
figure 1

Structures of investigated compounds and their abbreviations

Molecular connectivity indices

Just as in the previous papers [1113], the structure of the molecule is represented by Kier and Hall’s molecular connectivity indices.

The mth order molecular connectivity index is defined [10] by

$$ {}^m\chi_k={\displaystyle \sum_{j=1}^{n_m}{\displaystyle \prod_{i=1}^{m+1}{\left({\delta}_i\right)}_j^{-0.5}}} $$
(1)

where δ i is a connectivity degree, i.e., the number of non-hydrogen atoms to which the ith non-hydrogen atom is bonded;m is the order of the connectivity index;k denotes the type of the fragment of the molecule, for example: path (p), cluster (c), and path-cluster (pc); andn m is the number of fragments of type k and order m.

For molecules with the heteroatoms, the valence connectivity degree has been defined [10] as

$$ {\delta}_i^{\nu }=\frac{Z_i^{\nu }-{h}_i}{Z_i-{Z}_i^{\nu }-1} $$
(2)

where \( {Z}_i^{\nu } \) is the number of valence electrons in the ith atom, h i is the number of hydrogen atoms connected to the ith atom, and Z i is the number of all electrons in the ith atom.

The replacement δ i by \( {\delta}_i^{\nu } \) defines the valence molecular connectivity index \( {}^m\chi_k^{\nu } \).

An example of calculations of molecular connectivity indices for one of the investigated gemini surfactants is presented in Appendix 1.

The molecular connectivity indices contain considerable information about the molecule. The low-order molecular connectivity indices include information about atoms and molecular size while cluster and path/cluster molecular connectivity indices include structure information about branch point and branch point environment; the valence indices add information about heteroatoms [10, 22]. For example, 0 χ ν index includes information about heteroatoms contained in the molecule, 1 χ and 1 χ ν indices contain the information about molecular volume and molecular surface area; additionally, the 1 χ νadds information about heteroatoms but \( {}^3\chi_c^{\nu } \) index contains the information about the number of branches and their heteroatoms [10].

Statistics

The least squares method was used to generate the formula expressing the relationship between the logcmc and the molecular connectivity indices. In order to test the quality of the derived equation, three statistical parameters were used: a coefficient of determination (r 2), a correlation coefficient (r), a Fisher ratio (F), and a standard deviation (s). The best relationship is that which has possibly the highest values of r 2,r, and Fand simultaneously the lowest value of s.

In the case of the simple linear least-squares model, the values of statistical parameters may be calculated using the following formulas [10]:

$$ \hbox{-} \mathrm{the}\ \mathrm{coefficient}\ \mathrm{of}\ \mathrm{determination}:r{}^2=\frac{{\displaystyle \sum {\left({y}_i\left(\mathrm{cal}\right)-\overline{y}\right)}^2}}{{\displaystyle \sum {\left({y}_i\left( \exp \right)-\overline{y}\right)}^2}} $$
(3)

where y i (exp)—the experimental value of the property, y i (cal)—the calculated value of the property and \( \overline{y}=\frac{1}{n}{\displaystyle \sum_{i=1}^n{y}_i} \),

- the correlation coefficient (r) can be obtained from Eq. (3) as a square root of the coefficient of determination. Notice that this definition of r, in agreement with ref. [10], does not correspond to the standard definition of Pearson’s linear correlation coefficient, although it has a similar meaning.

$$ \hbox{-} \mathrm{the}\ \mathrm{Fisher}\ \mathrm{ratio}:F=\left(n-2\right)\cdot \frac{r^2}{\left(1-{r}^2\right)} $$
(4)
$$ \hbox{-} \mathrm{the}\ \mathrm{standard}\ \mathrm{deviation}\ \mathrm{of}\ \mathrm{the}\ \mathrm{fit}:s=\sqrt{\frac{{\displaystyle \sum {\left({y}_i\left( \exp \right)-{y}_i\left(\mathrm{cal}\right)\right)}^2}}{n-2}} $$
(5)

where n is the number of compounds in the data set,

$$ \hbox{-} \mathrm{the}\ \mathrm{residual}\ for\ \mathrm{compound}\ i:{\varDelta}_i={y}_i\left( \exp \right)-{y}_i\left(\mathrm{cal}\right) $$
(6)

Results and discussion

The values of the molecular connectivity indices along with the experimental logcmc values for the training set are listed in Table 1.

Table 1 Experimental logcmcvalues [1719] and values of molecular connectivity indices

Basing on the data contained in Table 1, the correlation formulas containing one index were derived (Step 1). All values of statistical parameters for the relationships obtained in the first step are listed in Table 2.

Table 2 Values ofstatistical parameters for Step 1

As follows from Table 2, the highest values of r and Fandthe lowest value of s are for the relationship containing the second-order molecular connectivity index (2 χ). The inspection of data contained in Table 2 suggests that 0 χ, 1 χ, and 0 χ ν indices also correlate well with the logcmc values. Table 3 shows the correlations between all the indices appearing in Table 1.

Table 3 Correlation matrix

Two indices with r ≥ 0.97 are highly correlated, those with 0.90 ≤ r < 0.97 are appreciably correlated, the indices with 0.50 ≤ r < 0.89 are weakly correlated, and those with r < 0.50 are not correlated. As follows from the correlation matrix, there are 12pairs of highly correlated indices, among them, the pairs of 0 χand2 χ indices with value of correlation coefficient 0.997. Because of, the 0 χ and2 χ indices carry similar structural information related to the changes of molecular structure, i.e., the values of these indices increase with the increase of atoms and branches in the molecule [10];therefore, we can ignore the 0 χ index in further considerations. The remaining indices which highly correlate with logcmc values, namely,2 χ, 1 χ, and 0 χ ν indices also highly correlate to each other (Table 3), but they contain somewhat different structure information; especially, their values vary with changes in molecular structure. The first-order molecular connectivity index (1 χ) decreases with the increase of branches, but the second-order molecular connectivity index (2 χ) increases with the increase of branches, whereas the valence molecular connectivity index of zero order (0 χ ν) encodes the information about heteroatoms [10, 22]. Thus, we keep these indices in the next considerations. These indices define models 1–3 in the first step. To these indices, the remaining indices were added separately (step 2). The values of the correlation coefficients for this step (second step) are contained in Table 4.

Table 4 Values of correlation coefficients for models 1–3 in step 2

Because the 2 χ index alone gives r = 0.982, therefore the relationships with pair of indices 1 χ and2 χ (r = 0.982) and also with the pair 0 χ ν and2 χ (r = 0.982) indices can be ignored in the further investigations. Next, from Table 4, it follows that in the case of models 1 and 3, the values of the correlation coefficients did not change significantly, so for those models, the step by step process was ended. In the case of model 2, the values of the correlation coefficients are higher for the relationships which contain additionally 3 χ c or \( {}^3\chi_c^{\nu } \) indices. The 3 χ c index encodes the information about the number of branches and their environment [10, 22]. The \( {}^3\chi_c^{\nu } \) index adds information about heteroatoms. Thus, the relationship containing the \( {}^3\chi_c^{\nu } \) index is richer in structural information than with the 3 χ c index. Furthermore, the addition others indices (step 3) did not change significantly the values of correlation coefficients therefore model 2 is now defined by the pair of indices 1 χand \( {}^3\chi_c^{\nu } \).

The obtained formulas (models 1–3) are given below:

$$ \mathrm{Model}\ 1: \log cmc=-0.17261-0.184411\cdot {}^2\chi $$
(7)
$$ \mathrm{Model}\ 2: \log cmc=0.18447-0.14866\cdot {}^1\chi -0.19248\cdot {}^3\chi_c^{\nu } $$
(8)
$$ \mathrm{Model}\ 3: \log cmc=0.44266-0.12317\cdot {}^0\chi^{\nu } $$
(9)

The statistical characteristics of the descriptors included in models 1–3 are shown in Appendix 2.

The plots of the experimental logcmc versus the logcmc calculated using Eqs. 79 are presented in Figs. 2, 3, and 4.

Fig. 2
figure 2

Plot of the experimental logcmc versus thatcalculated using Eq. 7 for training set (rhomb) (r = 0.982, F = 563.629, s = 0.095) and test compounds (triangle)

Fig. 3
figure 3

Plot of the experimental logcmc versus thatcalculated using Eq. 8 for training set (rhomb) (r = 0.983, F = 585.435, s = 0.093) and test compounds (triangle)

Fig. 4
figure 4

Plot of the experimental logcmc versus thatcalculated using Eq. 9 for training set (rhomb) (r = 0.975, F = 403.639, s = 0.111) and test compounds (triangle)

The comparisons of the experimental logcmc with the values calculated using Eqs. 79 presented in Figs. 2, 3, and 4 show that models 1–3 estimate the logcmc of compounds from the training set very well, and model 2 is slightly better than model 1 and better than Model 3. The values of coefficients of determination are equal to 0.964, 0.966, and 0.951 for models 1, 2, and 3, respectively.

The plots of residuals versus the experimental values of logcmc are shown in Figs. 5, 6, and 7.

Fig. 5
figure 5

Plot of residuals versus the experimental logcmc values for training set (rhomb) and test compounds (triangle) (model 1)

Fig. 6
figure 6

Plot of residuals versus the experimental logcmc values for training set (rhomb) and test compounds (triangle) (model 2)

Fig. 7
figure 7

Plot of residuals versus the experimental logcmc values for training set (rhomb) and test compounds (triangle) (model 3)

The examination of the residuals (Figs. 5, 6, and 7) shows generally good agreement between the experimental and calculated values of logcmc. Most of the residuals are close to zero and only one residual for model 1 is slightly larger than 2s.

The obtained models were used to estimate the logcmc values of other compounds, different from gemini surfactants from the training set. The values of the literature logcmc for test compounds are listed in Table 5.

Table 5 Experimental logcmc values [20, 21] of test compounds

The comparison of the experimental values of logcmcof the compounds used in the test with the values estimated using Eqs. 79 is shown in Figs. 2, 3, and 4. The agreement between predicted and experimental logcmc values of the test compounds is very good. The plots of residuals (Figs. 5, 6, and 7) confirm this agreement.

In brief, the best model in the first step is that which contains the second-order molecular connectivity index (2 χ) (model 1). The second step shows that the relationship containing the first-order molecular connectivity index (1 χ) and the third-order cluster valence molecular connectivity index (\( {}^3\chi_c^{\nu } \)) (model 2) estimates slightly better the values of the critical micelle concentration of cationic (chloride) gemini surfactants.

The second-order molecular connectivity index (2 χ) appearing in model 1 does not differentiate heteroatoms;it represents two-bond terms within the molecule and its values depend on the isomers of the compound [10]. The values of 2 χ index increase with the increase in length and branches of hydrocarbon chains. The zeroth-order valence molecular connectivity index (0 χ ν) appearing in model 3 relates to the atoms of the molecule, and it differentiates heteroatoms. The values of 0 χ ν index increase with the increase in length and branches of hydrocarbon chains, and its values are smaller for the compounds containing in their structure heteroatoms in comparison with those of their hydrocarbon analogous compounds. The first-order molecular connectivity index (1 χ) appearing in model 2 does not differentiate heteroatoms;it represents the one-bond terms within the molecule. The values of 1 χ index depend on the isomers of the compound and, in this case, decrease with the increase in branches, but its values increase with the increase in length of hydrocarbon chains. The third-order cluster valence molecular connectivity index (\( {}^3\chi_c^{\nu } \)) appearing in model 2 represents three-bond cluster terms within the molecule, and it differentiates heteroatoms. The values of \( {}^3\chi_c^{\nu } \) index increase with the increase in branches of hydrocarbon chains, and its values are smaller for the compounds containing in their structure heteroatoms in comparison with those of their hydrocarbon analogous compounds. All models contain the molecular connectivity indices with negative coefficients, thus as their values increase, the cmc decreases.

So, from Eqs. (7–9) and also from Table 1, it follows that as the number of methylene groups increases in the hydrocarbon chains,the cmc decreases. For example, for compound bis(EA-m-3iso)C6 (m = 9, 11), the experimental values of cmc are the following: 0.084 and 0.059 mM[], and the calculated values of cmc are the following: 0.101and 0.055 mM (model 1), 0.113and 0.057 mM (model 2), and 0.125and 0.056 mM (model 3). Also, as the number of methylene groups increases in the spacer group then the experimental and also the calculated values of the cmc decrease. For example, for compound AC12CnC12A (n = 6, 12), the experimental values of cmc are the following: 0.340and 0.160 mM[18], and the calculated values of cmc are the following: 0.401and 0.163 mM (model 1), 0.399and 0.143 mM (model 2), and 0.430and 0.129 mM (model 3). In the case of compounds bis(EA-m-3)R and bis(EA-m-3iso)R for R = 5, 6, 8, the experimental and also the calculated values of cmc decrease too with the increase in the alkyl chain length at the central nitrogen atom in the molecule. Thus, the increase in length of hydrocarbon chain and simultaneously in flexibility of this chain results in the decrease of cmc values.

The comparison of the compounds with straight and branched chains shows that the branches differently influence the calculated cmc values. For example, for compounds bis(EA-11-3)C8 and bis(EA-11-3iso)C8 using model 1, we obtain the following values of cmc: 0.043and 0.041 and the following using model 3: 0.041and 0.038 mM, whereas using model 2, we obtain 0.040and 0.041 mM, respectively. The experimental cmc values are 0.041 mM for compound bis(EA-11-3)C8 and 0.054 mM for compound bis(EA-11-3iso)C8 [17]. It means that the experimental value of cmc is higher for the compound bis(EA-11-3iso)C8; therefore, the cmc values calculated using model 2 are in good agreement with the experimental results. Some othergemini surfactants and the corresponding calculated values of logcmcare presented in Appendix 3. For the compounds presented in in Appendix 3, the cmc values which are calculated using models1 and 2 are smaller for the compounds with branched chains than for those with straight chains and the same number of atoms. Using model 3, the cmc values are smaller only for compounds with branched carbon chains but for compounds containing heteroatoms, the branches cause the higher cmc values. The result obtained for the compounds containing heteroatoms is in agreement with the experimental one [23]. That is, for chloride compounds C12EO1C12 (0.5 mM at 20 °C) and C12C4(OH)C12 (0.65 mM at 20 °C) [23].

The comparison of the heteroatom compounds with their hydrocarbon analogous compounds (Appendix 3) shows that the presence of heteroatoms in the molecules results in higher calculated, using Model 3, values of critical micelle concentration in comparison with its carbon analogous compounds. Model 2 differentiates heteroatoms only on branches but Model 1 does not differentiate heteroatoms. Some experimental results show higher values of critical micelle concentration of gemini surfactants containing in their structure heteroatoms in comparison with those of their hydrocarbon analogous compounds [18, 2426]. That is, for example, for bromide compounds C12EO2C12 (1.09 mM[24]) and C12C8C12 (0.84 mM[25]) and also for C127NHC12 (1.17 mM[26] and 1.21 mM[18]) and C12C7C12 (0.9 mM[26]). Also, the theoretical results obtained for cationic (bromide) gemini surfactants with various spacer group only [13] show that the presence of heteroatoms in the spacer group results in higher value of cmc. Thus,model 3 better describes the effect of heteroatoms on cmc values.

In brief, the investigated models (models 1–3 ) show high correlations between logcmc and the molecular connectivity indices and statistically, the best models (models1–2) can be used to estimate the values of critical micelle concentration, but the description of the effect of the structure of investigated compounds on cmc values by those models is different. All models describe the cmc values very well if we take into account only the elongation of alkyl chains. In the case of branches and heteroatoms, these models differently describe cmc values and some results differ from the experimental ones. It suggests that another index will be better to describe the effect of the structure on critical micelle concentration of cationic (chloride) gemini surfactants. Because some experimental data show that the branched chains especially branched hydrocarbon chains [17, 23, 27], and also heteroatoms [2426], cause the higher cmc values therefore the best index which will satisfactorilydescribe the effect of the chemical structure on cmc value is the first-order valence molecular connectivity index (1 χ ν). The first-order valence molecular connectivity index (1 χ ν) is similar to the first-order molecular connectivity index (1 χ), but it includes heteroatom information.The values of 1 χ ν index increase with the increase in length of hydrocarbon chains, and its values decrease with the increase in branches. This index differentiates heteroatoms and its values are smaller than the values of the 1 χ index.

The formula containing the 1 χ ν index is the following:

$$ \mathrm{Model}\ 4: \log cmc=0.56045-0.20443\cdot {}^1\chi^{\nu } $$
(10)

The statistical characteristic of the selected descriptor is given in Appendix 2.

From Eq. 10, it follows that as the number of methylene groups increases in the hydrocarbon chains and also in spacer chain,the cmc decreases. For example, for compound bis(EA-m-3iso)C8 (m = 9, 11), the experimental values of cmc are the following: 0.078and 0.054 mM[17], and the calculated values of cmc are the following: 0.104and 0.040 mM. For compounds with different spacer lengths AC12CnC12A (n = 6, 12), the experimental values of cmc are the following: 0.340and 0.160 mM[18], and the calculated values of cmc are the following: 0.414and 0.101 mM, respectively. The comparison of the compounds with straight and branched chains shows that the calculated values are also in good agreement with the experimental results. The example arethe compoundsbis(EA-11-3)C8 and bis(EA-11-3iso)C8, for which the experimental cmc values are the following: 0.041and 0.054 mM, and the calculated using model 4 cmc values are the following: 0.038and 0.040 mM, respectively.

The plot of the experimental logcmc versus the logcmc calculated using Eq. 10 and the plot of residuals versus the experimental values of logcmc for training set and test compounds are shown in Figs. 8 and 9

Fig. 8
figure 8

Plot of the experimental logcmc versus thatcalculated using Eq. 10 for training set (rhomb) (r = 0.957, F = 231.36, s = 0.144) and test compounds (triangle)

Fig. 9
figure 9

Plot of residuals versus the experimental logcmc values for training set (rhomb) and test compounds (triangle) (model 4)

The statistical parameters show that model 4 estimates logcmc values of investigated compounds lower than models 1–3, but comparison of the experimental and calculated values of cmc by means of the effect of the structural elements on cmc values shows that the values of critical micelle concentration calculated using model 4 are in good agreement with the experimental results. Some additional comparisons are presented in Appendix 3.

The data contained in Appendix 3 show that the increase in the number of atoms by lengthening or by the increase of branches causes the decrease of the cmc value calculated using models1–4. If we take in to account the heteroatom compounds, the effect of branches is in good agreement with experimental results obtained for bromide compounds [28]. But in the case of the elongation of hydrophilic spacer, as is for compounds C12EOnC12, the experimental results [24] show the opposite behavior. Maybe it is due to the fact that the length of hydrocarbon chains has the dominant effect on cmc values and, in consequence, on obtained models.

The experimental studies [18] show also that the cmc values of chloride gemini surfactants are higher than of bromides ones. Indeed, the experimental cmc values of C12C6C12gemini surfactant with bromides and chlorides as counterions are the following: 0.89and 1.30 mM[18], respectively. But, using previous model [12] for bromide geminis and present (Eqs. 710) for chlorides ones, we obtain the following calculated values of cmc: 1.11 mM[12] and 1.70 mM (model 1), 1.56 mM (model 2), 1.43 mM (model 3), and 1.24 mM (model 4), respectively. So, both the experimental and the calculated values of cmc of cationic (chloride) gemini surfactants are higher than for the bromide ones and in the case of chloride surfactants, the best estimated value is for model 4.

It is worth to add that the test compounds (Table 5) differ in structure of spacer and head groups from the training set compounds, but also for those molecules, the agreement between predicted and experimental logcmc values is very good.

Conclusion

In the present work, the cationic (chloride) gemini surfactants with various structures were taken into account. All the models obtained confirm the experimental results that the length of alkyl chains plays the major role in micelle formation. The present study shows that although the second-order molecular connectivity index correlates high with logcmc values of cationic gemini surfactants, the statistically lower correlation logcmc with the first order valence molecular connectivity index better describes the effect of the branches and heteroatoms on the critical micelle concentration of cationic (chloride) gemini surfactants. Becausemodel 4 (Eq. (10)) has good prediction ability of investigated compounds,it can be used to predict the critical micelle concentration and in particular to design new cationic (chloride) gemini surfactants more active in micelle formation.