Introduction

Toxicological assessment of phenolic compounds is essential for risk-assessment purposes. Compounds with a single aromatic ring substituted with a hydroxyl group (the phenols) are ubiquitous in nature and are used in many industries including those involving textiles, leather, paper, and oil. They are also commonly used food additives and frequently utilized in agriculture [1]. There has therefore been great interest in assessing the toxicity of such compounds. The impact of the potential hazard of untested chemicals, a challenge confronting national and international regulatory agencies [25], can be measured by experimental investigations, but this approach is both quite expensive and time-consuming. This has meant that the development of computational methods as an alternative tool for predicting the properties of chemicals has been a subject of intensive study. Among computational methods quantitative structure–activity relationships (QSAR) have found diverse applications for predicting compounds’ properties, including biological activity prediction [6], physical property prediction [7], and toxicity prediction [8, 9]. QSPR/QSAR models are essentially calibration models in which the independent variables are molecular descriptors that describe the structure of molecules and the dependent variable is the property/activity of interest. In QSAR studies, techniques which can be used for model construction, for example multiple linear regression (MLR) and artificial neural networks (ANN), have been used for inspection of linear and nonlinear relationships between the activity of interest and molecular descriptors. Artificial neural networks have become popular in QSPR/QSAR models because of their success where complex non-linear relationships exist amongst data [10, 11]. An ANN is formed from artificial neurons connected with coefficients (weights), which constitute the neural structure and are organized in layers. The layers of neurons between the input and output layers are called hidden layers. Neural networks do not need explicit formulation of the mathematical or physical relationships of the problem handled. These give ANNs an advantage over traditional fitting methods for some chemical applications. For these reasons, in recent years ANNs have been applied to a wide variety of chemical problems [1220]. Application of these techniques usually requires selection of variables to build well-fitting models. Nowadays, genetic algorithms (GA) are well-known as interesting and more widely used methods for variable selection [2123]. GA are stochastic methods used to solve optimization problems defined by fitness criteria, by applying the evolution hypothesis of Darwin and different genetic functions, i.e., crossover and mutation.

QSAR models have been used to predict the toxicity of phenols [1, 24, 25]. Two approaches have been suggested in this modeling and in similar QSAR modeling. The first of these is the development of “global” models which are defined as QSAR models that cover a number of different mechanisms of action for a given toxicological endpoint. The use of the term “global model” in this study is distinct from that used to define QSAR models based on chemicals with similar modes of action allowing interspecies correlations. The second is the development of a number of “local” models, each covering a single mechanism of action present in the database [26]. Very recently, Enoch et al. [26] used a global QSAR method for prediction of the toxicity of phenols. The ability of the proposed global QSAR model to predict the toxicity of phenols is poor (correlation coefficients (r 2) of the model are 0.71 and 0.73 for training and test sets) [26].

In order to predict accurately the toxicity of these compounds, in this work genetic algorithm–multiparameter linear regression (GA-MLR) and genetic algorithm–artificial neural network (GA-ANN) global models were used to generate QSAR models between the descriptors and toxicity of 250 phenols with diverse chemical structures. The results obtained were compared with each other, with those from previous work [26], and with the experimental values.

Results and discussion

For selection of the most important descriptors the genetic algorithm technique was used. To select the optimum number of descriptors, the influences of the number of the descriptors were investigated for one to ten descriptors.

The R 2 value can be generally increased by adding the additional predictor variables to the model, even if the added variable does not contribute to the reduction of the unexplained variance of the dependent variable. Therefore, the R 2 usage requires special attention. For this reason, it is better to use another statistical parameter, called the adjusted R 2 (R 2adj ), were R 2adj is defined by Eq. 1.

$$ R^{2}_{\text{adj}} = 1- \left( { 1- R^{2} } \right)\left( {{\frac{n - 1}{n - p - 1}}} \right) $$
(1)

R 2adj is interpreted similarly to the R 2 value, considering the number of degrees of freedom also. It is adjusted by dividing the residual sum of squares and total sum of squares by their respective degrees of freedom. The R 2adj value diminishes if an added variable to the equation does not reduce the unexplained variance [27]. Subsequently, R 2adj is used to compare models with different numbers of predictor variables.

Another statistical parameter is the standard error of the estimate(s) that measures the dispersion of the observed values about the regression line. When the s value is low, the reliability of the prediction is higher. Figure 1 shows plots of R 2, R 2adj , and s for the training set as a function of the number of descriptors for the 1–10 descriptors in the models. R 2 and R 2adj increased with increasing number of descriptors. However, the values of s decreased with increasing number of descriptors. As models with 7–10 descriptors did not significantly improve the statistics of the models, it was determined that the optimum subset size had been achieved with a maximum of 6 descriptors.

Fig. 1
figure 1

Influences of the number of descriptors on R 2 (filled circle), R 2adj (open circle), and s (filled triangle) of the regression model

The selected variables and the correlation matrix of the descriptors are listed in Table 1, from which it can be seen that the correlation coefficient value of each pair of descriptors was less than 0.65, which meant that the selected descriptors are independent.

Table 1 Correlation coefficient matrix of the selected descriptors

To examine the relative importance, and the contribution of each descriptor in the model, for each descriptor the value of the mean effect (MF) was calculated. This calculation was performed by use of Eq. 2.

$$ {\text{MF}}_{j} = {\frac{{\beta_{j} \sum\nolimits_{i = 1}^{i = n} {d_{ij} } }}{{\sum\nolimits_{j}^{m} {\beta_{j} \sum\nolimits_{i}^{n} {d_{ij} } } }}} $$
(2)

MF j represents the mean effect for the considered descriptor j, β j is the coefficient of the descriptor j, d ij stands for the value of the target descriptors for each molecule, and m is the descriptor’s number in the model. The MF value indicates the relative importance of a descriptor, compared with the other descriptors in the model. Its sign shows the direction of variation in the toxicity values as a result of the increase (or reduction) of the descriptor values. The mean effect values are −0.043, 1.071, −0.081, 0.035, −0.004, and 0.023 for Xt, MATS1m, PJI3, Mor23u, nCs, and H-046. By interpreting the descriptors contained in the model, it is possible to gain useful chemical insights into the toxicity of phenols. For this reason, an acceptable interpretation of the QSAR results is provided below.

The first descriptor which has appeared in the model is Xt (total structure connectivity index). Connectivity indices are among the most popular topological indices and are calculated from the vertex degree of the atoms in the H-depleted molecular graph. Xt is a connectivity index contemporarily accounting for all the atoms in the graph. Also the total structure connectivity index is the square root of the simple topological index that is proposed for measuring molecular branching [28]. The mean effect of Xt has a negative sign, which indicates that an increase in the molecular branch leads to a decrease in its pIG 50 value.

The second descriptor is MATS1m (Moran autocorrelation—lag 1/weighted by atomic masses), which is a 2D autocorrelation descriptor. In this descriptor the Moran coefficient is a distance-type function, and is any physicochemical property calculated for each atom of the molecule, for example atomic mass, polarizability, etc. The Moran coefficient usually takes a value in the interval [−1, +1]. Positive autocorrelation corresponds to positive values of the coefficient whereas negative autocorrelation produces negative values. Therefore, the molecule atoms represent a set of discrete points in space and the atomic property is the function evaluated at those points. The physicochemical property in this case is the atomic mass. MATS1m has a positive sign, illustrating a greater mean effect value than that of the other descriptors, which indicates that this descriptor had a significant effect on the toxicity and that the pIG 50 value is directly related to this descriptor. Hence, it was concluded that by increasing the molecular mass the value of this descriptor increased, causing an increase in its pIG 50 value.

The third descriptor is PJI3 (3D Petijean shape index), which is a geometrical descriptor. The Petitjean shape index is a topological anisometry descriptor also called a graph-theoretical shape coefficient that is calculated from the topological radius and the topological diameter obtained from the distance matrix representing the considered molecular graph. PJI3 has a negative sign, which indicates that the pIG 50 is inversely related to this descriptor.

Mor23u is the fourth descriptor appearing in the model. It is a 3D-MoRSE descriptor. 3D MoRSE descriptors (3D molecule representation of structures based on electron diffraction) are derived from infrared spectra simulation using a generalized scattering function [28]. This descriptor was proposed as signal 23/unweighted. Mor23u has a positive sign, which indicates that the pIG 50 is directly related to this descriptor.

The fifth descriptor is nCs which is one of the functional groups. nCs represents the number of total secondary C(sp3). The mean effect of nCs has a negative sign, which indicates that an increase in the number of secondary C(sp3) of the molecule leads to a decrease in its pIG 50 value.

The final descriptor of the model was the H-046 (H attached to C0 (sp3)). It is one of the atom-centered fragment descriptors that describe each atom by its own atom type and the bond types and atom types of its first neighbors. This descriptor represents the first neighbor (hydrogen) of carbon atoms. This descriptor has a positive sign, which indicates that the pIG 50 is directly related to this descriptor.

In summary, it is concluded that the molecular branching, the molecular mass, the molecular shape, the number of secondary C(sp3) of molecules, and the first neighbor (hydrogen) of carbon atoms are of major importance in the toxicity of the compounds studied.

Genetic algorithm: multiparameter linear regression

We used a GA for selection of the most relevant descriptors. Multiparameter linear correlation of pIG 50 values for 150 different phenolic compounds in the training set was achieved by the GA by use of the six descriptors selected, and the following equation was obtained:

$$ \begin{aligned} pIG_{50} = & - 1 5.0 5\left( { \pm 1. 6 6} \right) - 1 5. 7 7\left( { \pm 2.00} \right){\text{Xt}} + 1 7. 8 4\left( { \pm 1. 5 8} \right){\text{MATS1m}} \\ & \quad - 1. 8 4\left( { \pm 0. 3 1} \right){\text{PJI3}} - 1. 2 3\left( { \pm 0. 1 8} \right){\text{Mor23u}} - 0. 1 2\left( { \pm 0.0 4} \right){\text{nCs}} \\ & \quad + 0. 1 4\left( { \pm 0.0 1} \right){\text{H}} - 0 4 6\\ \end{aligned} $$
(3)

The model was then used to predict pIG 50 values for the compounds in the validation and prediction sets. The prediction results are given in Table 2. The calculated values of pIG 50 for the compounds in the training, validation, and prediction sets using the GA-MLR model have also been plotted versus their experimental values (Fig. 2). The correlation coefficients, r 2, obtained were 0.747 for the training set, 0.721 for the validation set, and 0.516 for the prediction set. Table 3 shows the root mean square error (RMSE) and r 2 of the model for total, training, validation, and prediction sets.

Table 2 Experimental values of the toxicity of phenols to Tetrahymena pyriformis (pIG 50) and the values calculated by the GA-MLR and GA-ANN global models
Fig. 2
figure 2

Plot of the calculated values of pIG 50 from the GA-MLR model versus the experimental values for the training (open circle), validation (filled circle), and prediction (filled triangle) sets

Table 3 Comparison of statistical data obtained by the GA-MLR and GA-ANN models for the toxicity (pIG 50) of phenols

The model obtained was validated using the leave-one-out (LOO) and leave-group-out (LGO) cross-validation processes. For LOO cross-validation, a data point is removed from the set and the model is recalculated. The predicted activity for that point is then compared with its actual value. This is repeated until each data point has been omitted once. For LGO, 20% of the data points are removed from the dataset and the model refitted; the values predicted for those points are then compared with the experimental values. Again, this is repeated until each data point has been omitted once. The crossvalidated correlation coefficient (Q 2) was 0.620 for LGO and 0.728 for LOO. This indicates that the regression model obtained has good internal and external predictive power.

Genetic algorithm–artificial neural network

To process the non-linear relationships between the toxicity and the descriptors the ANN modeling method combined with GA for feature selection was employed. The input vectors were the set of descriptors which were selected by the GA, and therefore the number of nodes in the input layer was dependent on the number of selected descriptors. In the GA-MLR model it is assumed that the descriptors are independent of each other and have truly additive relevance to the property under study. ANNs are particularly well-suited for QSAR/QSPR models because of their ability to extract non-linear information present in the data matrix. For this reason the next step in this work was generation of the ANN model. There are no rigorous theoretical principles for choosing the proper network topology; so different structures were tested in order to obtain the optimum number of hidden neurons and training cycles [1720]. Before training the network, the number of nodes in the hidden layer was optimized. In order to optimize the number of nodes in the hidden layer, several training sessions were conducted with different numbers of hidden nodes (from 1 to 18). The root mean square error of training (RMSET) and validation (RMSEV) sets were obtained at various iterations for different numbers of neurons in the hidden layer and the minimum value of RMSEV was recorded as the optimum value. A plot of RMSET and RMSEV versus the number of nodes in the hidden layer is shown in Fig. 3. It is clear that fifteen nodes in the hidden layer is the optimum value.

Fig. 3
figure 3

Plot of RMSE for training (open circles) and validation (filled circles) sets versus the number of nodes in the hidden layer

This network consists of six inputs, the same descriptors as in the GA-MLR model, and one output for pIG 50. Then an ANN with architecture 6-15-1 was generated. It is noteworthy that training of the network was stopped when the RMSEV started to increase, i.e., when overtraining begins. The overtraining causes the ANN to lose its prediction power [11]. Therefore, during training of the network, it is desirable that iterations are stopped when overtraining begins. To control the overtraining of the network during the training procedure, the values of RMSET and RMSEV were calculated and recorded to monitor the extent of learning in the various iterations. Results showed that overtraining did not occur in the optimum architecture (Fig. 4).

Fig. 4
figure 4

Plot of RMSE for training (open circles) and validation (filled circles) sets versus the number of iterations

The generated ANN was then trained using the training and validation sets for optimization of the weights and biases. For evaluation of the predictive power of the generated ANN, an optimized network was used for prediction of the pIG 50 values in the prediction set, which were not used in the modeling procedure (Table 2). The calculated values of pIG 50 for the compounds in the training, validation, and prediction sets using the ANN model have been plotted versus their experimental values in Fig. 5. A plot of the residuals for the calculated values of pIG 50 in the training, validation, and prediction sets versus their experimental values is presented in Fig. 6. As can be seen, the model did not show proportional and systematic error, because the distribution of the residuals on both sides of zero are random.

Fig. 5
figure 5

Plot of the calculated values of pIG 50 from the GA-ANN model versus their experimental values for the training (open circles), validation (filled circles), and prediction (filled triangles) sets

Fig. 6
figure 6

Plot of the residuals for calculated values of pIG 50 from the GA-ANN model versus their experimental values for the training (open circles), validation (filled circles), and prediction (filled triangles) sets

As expected, the calculated values of pIG 50 are in good agreement with the experimental values. The correlation equation for all of the calculated values of pIG 50 from the ANN model and the experimental values is given by Eq. 4.

$$ pIG_{50} \left( {\text{cal}} \right) = 0. 9 2 7\,pIG_{50} \left( { \exp } \right) + 0.0 5 4\\ \left( {r^{2} = 0. 9 2 9;{\text{ RMSE}} = 0. 2 20; \, F = 3 2 5 7. 5 2 3} \right)\\ $$
(4)

Similarly, the correlation of pIG 50 (cal) versus pIG 50 (exp) values in the prediction set is given by Eq. 5.

$$ pIG_{50} \left( {\text{cal}} \right) = 0. 9 2 7\, pIG_{50} \left( { \exp } \right) + 0.0 7 9\\ (r^{2} = 0. 9 2 6;{\text{ RMSE}} = 0. 2 2 4; \, F = 5 9 9.0 7 5) \\ $$
(5)

Table 3 compares the results obtained using the GA-MLR and GA-ANN models. The r 2 and RMSE of the models for the total, training, validation, and prediction sets show the potential of the ANN model for prediction of pIG 50 values of phenolic compounds using a global QSAR model. As a result, it was found that a properly selected and trained neural network could fairly represent the dependence of the toxicity of phenols on the descriptors. The optimized neural network could then simulate the complicated nonlinear relationship between pIG 50 value and the descriptors. The RMSE of 0.634 for the prediction set by the GA-MLR model should be compared with the value of 0.224 by the GA-ANN model. As can be seen, the ability of the proposed model to predict the pIG 50 is better than the QSAR models proposed recently [26]. It can be seen from Table 3 that although parameters appearing in the GA-MLR model are used as inputs for the generated GA-ANN model, the statistics indicate substantial improvement. These improvements are because of the non-linear correlation of the toxicity of phenols to Tetrahymena pyriformis with the selected descriptors.

Data and methodology

The data set of toxicity values (pIG 50, or Log (1/IGC 50)) for the 250 phenolic compounds used for the QSAR models was selected from literature [1]. The data set was randomly split into training, validation, and prediction sets (150, 50, and 50 compounds, Table 2). The z-matrices (molecular models) were constructed with HyperChem 7.0 and molecular structures were optimized using the AM1 algorithm [29]. In order to calculate the theoretical descriptors, Dragon package version 2.1 was used [30]. For this purpose the output of the HyperChem software for each compound was fed into the Dragon program and the descriptors were calculated. As a result, a total of 1,481 theoretical descriptors were calculated for each compound in the data sets (250 compounds).

The theoretical descriptors were reduced by the following procedure:

  1. 1

    descriptors that were constant were eliminated (394 descriptors); and

  2. 2

    to reduce the redundancy existing in the descriptors, the correlation of the descriptors with each other and with pIG 50 of the molecules were examined, and collinear descriptors (R > 0.9) were detected. Among the collinear descriptors, that with the highest correlation with toxicity values was retained, and the others were removed from the data matrix (703 descriptors).

The genetic algorithm (GA)

To select the most relevant descriptors, evolution of the population was simulated [3135]. Each individual of the population defined by a chromosome of binary values represented a subset of descriptors. The number of genes on each chromosome was equal to the number of descriptors. The population of the first generation was selected randomly. A gene took a value of 1 if its corresponding descriptor was included in the subset; otherwise, it took a value of zero. The number of genes with a value of 1 was kept relatively low to furnish a small subset of descriptors [35], that is, the probability of generating 0 for a gene was set greater (at least 60%) than that of generating 1. The operators used here were crossover and mutation. The probability of the application of these operators was varied linearly with generation renewal (0–0.1% for mutation and 60–90% for crossover). The population size was varied between 50 and 250 for different GA runs. For a typical run, the evolution of the generation was stopped when 90% of the generations took the same fitness [21]. The GA program was written in Matlab 6.5 [36].

The artificial neural network (ANN)

A feed-forward artificial neural network with a back-propagation (BP) of error algorithm was used to process the non-linear relationship between the selected descriptors and the toxicity (pIG 50). The number of input nodes in the ANN was equal to the number of descriptors appearing in the MLR model. The ANN model is confined to a single hidden layer, because a network with more than one hidden layer would be harder to train. A three-layer network with a sigmoidal transfer function was designed. The initial weights were randomly selected between 0 and 1. Optimization of the weights and biases was carried out according to Levenberg–Marquardt algorithms for BP of error, which, although requiring far more extensive computer memory, are significantly faster than other algorithms based on gradient descent [37]. The data set was randomly divided into three groups: a training set, a validation set, and a prediction set consisting of 150, 50, and 50 molecules. The training and validation sets were used for generation of the model and the prediction set was used for evaluation of the generated model. The performances of the training, validation, and prediction of models were evaluated as the root mean square error (RMSE), which is defined by Eq. 6.

$$ {\text{RMSE}} = \sqrt {\sum\limits_{i = 1}^{N} {{\frac{{(P_{i}^{ \exp } - P_{i}^{\text{cal}} )^{2} }}{N}}} } $$
(6)

where P exp i and P cal i are experimental values of pIG 50 and calculated with the models and N denotes the number of data points. The residual is defined by Eq. 7.

$$ {\text{Residual}} = P_{i}^{ \exp } - P_{i}^{\text{cal}} . $$
(7)

The processing of the data was carried out using Matlab 6.5 [38]. The neural networks were implemented using Neural Network Toolbox Ver. 4.0 for Matlab [39].

Conclusion

In this study, linear (GA-MLR) and nonlinear (GA-ANN) global QSAR models were used to construct quantitative relationships between the toxicity of phenols to Tetrahymena pyriformis and their calculated descriptors. Comparison of the results obtained by use of the GA-ANN and the GA-MLR confirmed the superiority of the GA-ANN model as a more powerful method to predict pIG 50. A suitable model with high statistical quality and low prediction errors was eventually derived. Because the improvement of the results obtained by use of the non-linear model (GA-ANN) is substantial, it can be concluded there is a non-linear correlation between the descriptors and the pIG 50 values of the phenols.