An efficient piecewise linear model for predicting activity of caspase-3 inhibitors

Background and purpose of the study Multimodal distribution of descriptors makes it more difficult to fit a single global model to model the entire data set in quantitative structure activity relationship (QSAR) studies. Methods The linear (Multiple linear regression; MLR), non-linear (Artificial neural network; ANN), and an approach based on “Extended Classifier System in Function approximation” (XCSF) were applied herein to model the biological activity of 658 caspase-3 inhibitors. Results Various kinds of molecular descriptors were calculated to represent the molecular structures of the compounds. The original data set was partitioned into the training and test sets by the K-means classification method. Prediction error on the test data set indicated that the XCSF as a local model estimates caspase-3 inhibition activity, better than the global models such as MLR and ANN. The atom-centered fragment type CR2X2, electronegativity, polarizability, and atomic radius and also the lipophilicity of the molecule, were the main independent factors contributing to the caspase-3 inhibition activity. Conclusions The results of this study may be exploited for further design of novel caspase-3 inhibitors.


Introduction
There are some challenges such as new disease and drug resistance which our society is faced with. On the other hand, drug discovery is a costly and time consuming process. In this context, there is a great demand for predictive models to design new drugs with improved properties and diminished side effects [1]. Furthermore, there is also a demand for new methods that replace and reduce the use of laboratory animals [2]. These methods should be used in the design and evaluation of experimental tests and in the selection of appropriate test compounds for validation studies [3].
In order to reduce the time and cost during the drug development process, quantitative structure activity relationship (QSAR) is a robust scientific method which can help scientists to predict the activity and side effects of new compounds [4,5]. Distribution of training data affects on the model that is suitable for it, so searching to find a convenient method is an essential part of QSAR model evaluation to predict the biological activity for new targets and new compounds.
Herein, we described a linear (Multiple linear regression, MLR) and one non-linear (Artificial neural network, ANN) as global models and Extended Classifier System in Function approximation (XCSF) as a local model to develop a convenient model for predicting caspase-3 inhibition activity [6]. Caspases are cysteine-aspartic specific proteases involved in the signaling cascades of programmed cell death. Caspase-3 is a key executioner member of the caspase family which propagates death signals from intrinsic and extrinsic stimuli to downstream targets. Inappropriate control of caspases (especially caspase-3) as apoptosis machinery has been implicated in many diseases, including neurodegenerative disorders, cancer, and autoimmune diseases [7,8]. In this study, we tried to develop and choose a robust model to predict caspase-3 inhibition activity as a scientific method to facilitate the discovery of anti-neurodegenerative agents.

Biological data and descriptors generation
Structures of caspase-3 inhibitors and corresponding biological activities were downloaded from Binding database in MDL, in SD file format, which were then extracted and converted to corresponding Mol format files [9]. In total 658 structures with experimentally tested IC 50 were used as input data in this study. After geometry optimization of the molecular structures, the Dragon software (version 2.1) was used to calculate descriptors. In total 1481 descriptors (variables) were calculated by this procedure.

Feature selection and linear regression
In order to reduce the number of descriptors, a feature selection algorithm was carried out in four steps. In the first step, variables with more than 80% constant values were removed. For the second step, variables with less than 0.7 correlations with biological activity (caspase-3 inhibition activity) were omitted.
In the third step, descriptors which showed 0.7 cross correlation with each other were removed from variables. After these three steps, 99 variables remained in the data set, but they were more than what was required for QSAR models. For solving this problem, multivariate linear regression was implemented and stepwise selection method was used as the fourth step. 24 descriptors remained after the fourth step and were used for the linear model (24 remained descriptors are shown in Table 1). Mean square error of linear regression model on the training set is 0.262.
For the non-linear models the 24 selected variables, which have been selected by linear regression, are too many, so in the next step principal component analysis (PCA) on 24 selected variables was applied. After this step, seven variables were selected (See Table 2).
As shown in Table 2, after PCA, one descriptor was selected among each category of descriptors and some categories were eliminated completely. Among selected descriptors, 3D ones have more contribution rather than 1D and 2D ones, which confirms the role of spatial restrictions of caspase-3 receptor to pick up its inhibitors.

Extended classifier system in function approximation
XCSF is a piecewise linear function approximation system which approximates function values in the continuous space [1,11]. There are some differences between XCSF and XCS representation of conditions, classifier prediction mechanism, update process are some essential difference points between XCSF and XCS [12].
In XCSF, each classifier consists of a condition, an action, and some other parameters. The condition covers input states that the classifier matches with them. The action specifies the action for which the payoff is predicted and it is executed on environment. XCSF as a pure function approximator has a dummy action which has no actual effect on the environment. The main XCSF parameters are: the weight vector ; w; that is used to  compute the classifier prediction of input state; the prediction error , ε, which estimates the error affecting classifier prediction; the numerosity, num, used to determine number of different copies of the same classifier. The weight vector ; w; which has n + 1 components in n dimensional state space (Each w i is corresponding to a special feature of input space and w 0 is corresponding to a constant input, x 0 , which is set as a parameter of XCSF). Performance component of XCSF works as XCS. At each time step XCSF builds a match set, [M], containing the classifiers in the classifier list or population, [P], whose condition matches with current input state; if [M] contains less than θ mna classifiers, covering process occurs as in XCSI 12 ; making a classifier that matches with current state and inserting it to [M]. In the covering process, the weight vectors of classifiers are initialized with zero values; all the other parameters are initialized as in XCS [13]. In XCSF as a pure function approximator, prediction is computed by the fitnessweighted average of all matching classifiers: Where s t is current sensory input, cl is a classifier, [M] represents match set, cl.F is the fitness of cl classifier, and cl.p(s t ) is the prediction of cl in state s t which is computed as: Where s t is current sensory input, cl.w i is the weight w i of clthat is used to compute the classifier, and i is related to i'th dimension of input state. XCSF uses the reward value P (actual function value for current input) to update the parameters of classifiers in match set. The weight vector w of the classifiers in [M] are updated using a modified delta rule: Where η is the correction rate, |s t (i)| 2 is the norm input vector s t [1]. The weights of classifier cl are updated using Δw i values as: Then the prediction error ε is updated as: cl:ε ¼ cl:ε þ β P À cl:p s t ð Þ j jÀ cl:ε ð Þ ð 5Þ Where cl.ε is the prediction error of classifier cl, β is learning rate, and P is the reward value. Classifier fitness is updated similar to XCS.
The genetic algorithm in XCSF works as in XCSI 1 . Genetic algorithm (GA) is applied to improve the rule set of XCSF by generating new classifiers which contribute to existing knowledge and removing classifiers which do not offer any improved contributions. In function approximation, the genetic algorithm (GA) is applied to the classifiers of match set [M]. Firing of GA component is directly depending on θ ga . Two classifiers are selected as parents with the probability proportional to their fitness. Crossover performed with a fixed probability p c on copies of individuals and with probability p m mutation changes their allele. Before inserting off springs to the population set, in order to keep a fixed population size, two classifiers may be deleted. For a sufficient experienced and accurate classifier deletion probability is proportional to its set size and fitness. Hence, if an experienced classifier has lower fitness rather than average fitness of classifiers in population set, its deletion probability is increased [11,13]. So, generation of maximally general conditions that efficiently cover the feature space is performed by GA progress.

Artificial neural network
To examine the ability of 7 selected features in predicting activity values of inhibitors, selected variables using feature selection filters are fed into input layer of ANN. A three-layer neural network with 7-X-1 structure is used in this study. ANN parameters were optimized according to trial-and-error procedure. Data set were divided to training, validation, and test subsets. Validation set results directed us to find optimal setting for ANN. To access the performance of fully-trained model, test samples are evaluated and after evaluating the final model by using the test set, the model parameters should not tune further.

Results and discussion
The proper selection of a training set is one of the most basic operations in quantitative structure activity relationship studies. Small, relevant, and homogeneous data sets have and continue to be the workhorse for structure-activity predictions when the activity for a new analogue is needed for a particular chemical series. For large data sets, however, the selection of a training set is critical since compounds of diverse chemical structure are contained within the chemical space of the database.
To remove the dependency between the training and testing samples, 10-fold cross validation is performed [14]. The original samples are randomly partitioned into 10 subsets. Of the 10 subsamples, a single subsample is retained for testing the model, and the remaining 9 subsamples are used in the training process. The crossvalidation process is repeated 10 times, so each of the 10 subsamples used exactly once as the validation data. All observations are used for both training and validation sets, and each observation is used for validation exactly once.
In this study, in a preprocessing phase, training data are partitioned into multiple clusters (by the K-means algorithm) [14]. Optimal number of clusters is determined by the Silhouette method [15]. Average distance between each training sample and center of its corresponding cluster is considered as stretch variation in condition creation. The achieved results using XCSF classifier is computed by averaging among 20 experiments. The XCSF parameters are set as N = 656, maxPopSize = 30, functionSize = 7, cov-erConditionRange in this problem is approximately 0.17, realConditionType is hyperEllipsoidalCondition, ellipsoi-dalConditionType is axis-parallel ellipsoid, type of prediction is linear prediction with RLS update and other parameters are set according to reference [16].
Mean square error (MSE) of each method has been calculated for all 10 subsets (See Table 3). The plot of the mean square errors versus the 10 subsets is shown in Figure 1. As depicted, the XCSF models are superior to both of the MLR and ANN models. Mean of these errors has been used to calculate P-values (See Table 4). Table 4 shows that the differences between XCSF and ANN and also between XCSF and MLR are significant. According to Tables 3 and 4, the XCSF model is an appropriate model with accepted differences from the with MLR and ANN models.

Descriptors description
Besides demonstrating statistical significance, QSAR models should also provide useful chemical insights for drug design. For this reason, an acceptable interpretation of the QSAR results is provided below. By interpreting the descriptors contained in the model, it is possible to gain some insight into the factors which are related to the caspase-3 inhibition activity.
C-012 is the first descriptor, appearing in the model. It is one of the atom-centered fragment descriptors that describes each atom by its own atom type, the bond types and atom types of its first neighbors. The C-012 descriptor displays CR 2 X 2 . This atom-centered fragment descriptor can be defined as a central carbon atom (C) that has two carbon neighbors (R 2 ) and two heteroatom neighbors (X 2 ).
The second descriptor is MLOGP which is one of the molecular properties descriptors. The molecular descriptor MLOGP refers to the Moriguchi log of the octanol/ water partition coefficient of the molecule and is considered as a measure of lipophilicity of a molecule.
The next descriptor is GATS6e (Geary autocorrelationlag 6 / weighted by atomic Sanderson electronegativities), which is one of the 2D autocorrelations descriptors. In this descriptor the Geary coefficient is a distance-type function, that function is any physico-chemical property calculated for each atom of the molecule, such as atomic mass, polarizability, etc. Therefore, the molecule atoms represent the set of discrete points in space and the atomic property the function evaluated at those points. The physico-chemical property in this case is atomic Sanderson electronegativities.
The 2D Petitjean shape index (PJI2), is one of the topological descriptors. This descriptor also called graphtheoretical shape coefficient, is proposed to describe the topological anisometric. This molecular shape descriptor describes the degree of deviation from a perfect cyclic topology.
Mor12p is the next descriptor, appearing in the model. It is one of the 3D-MoRSE descriptors. 3D MoRSE descriptors (3D Molecule Representation of Structures based on Electron diffraction) are derived from Infrared spectra simulation using a generalized scattering function. This descriptor was proposed as signal 12 / weighted by atomic polarizability which relates to polarizability of the molecules.
The sixth descriptor of the model was the R autocorrelation of lag 4 / weighted by atomic Sanderson electronegativities (R4e). It belongs to the GETAWAY descriptors. This descriptor is related to the electronegativity, the size and the location of the atom in the molecule. The greater the electronegativity, atomic radius and the distance between the atom and the center of the molecule are, the greater the descriptor value is.
The final descriptor is RDF070m (Radial Distribution Function -7.0 / weighted by atomic masses), which is one of the radial distribution function (RDF) descriptors. RDF in this form meets all the requirements for the 3D structure descriptors. It is independent of the atom number (i. e. the size of a molecule), and is unique regarding the three-dimensional arrangement of the atoms and is also invariant against the translation and rotation of the entire molecule. Additionally, the RDF descriptors can be restricted to specific atom types or distance ranges to represent specific information in a certain three-dimensional structure space. In summary, it is concluded that; the atom centered fragment type CR 2 X 2 , electronegativity, polarizability, size of atoms and the lipophilicity of the molecule play a main role in the caspase-3 inhibition activity of the studied compounds.

Conclusion
In this study we compare three global and local QSAR modeling techniques in caspase-3 activity prediction. Kmeans and Silhouette method show that data set is distributed in different locality of feature space. Linear regression as a global method without using the information of different clusters that are scattered in the data set tunes a single hyper plane to predict the activity function. Applying global nonlinear methods such as neural networks can be more useful than global linear methods. Neural networks try to match its prediction models to entire data set precisely so that may cause over training error. Multimodal distribution of features generated from caspace-3 data set causes to apply a local model which tunes a modeling system on each cluster of data set. XCSF as a piecewise linear system in function approximation problems is used for prediction of activity value corresponding to each sample. As expected, results show that global nonlinear method predicts activities more accurate than global linear regression method and a local model that regards the locality information is the most accurate. Results are presented for caspase-3 data indicating that XCSF as a local modeling strategy is preferred over a global strategies such as linear regression and ANN.