A study on quantitative structure–activity relationship and molecular docking of metalloproteinase inhibitors based on L-tyrosine scaffold

Background MMP-2 enzyme is a kind of matrix metalloproteinases that digests the denatured collagens and gelatins. It is highly involved in the process of tumor invasion and has been considered as a promising target for cancer therapy. The structural requirements of an MMP-2 inhibitor are: (1) a functional group that binds the zinc ion, and (2) a functional group which interacts with the enzyme backbone and the side chains which undergo effective interactions with the enzyme subsites. Methods In the present study, a QSAR model was generated to screen new inhibitors of MMP-2 based on L–hydroxy tyrosine scaffold. Descriptors generation were done by Hyperchem 8, DRAGON and Gaussian98W programs. SPSS and MATLAB programs have been used for multiple linear regression (MLR) and genetic algorithm partial least squares (GA-PLS) analyses and for theoretical validation. Applicability domain of the model was performed to screen new compounds. The binding site potential of all inhibitors was verified by structure-based docking according to their binding energy and then the best inhibitors were selected. Results The best QSAR models in MLR and GA-PLS were reported, with the square correlation coefficient for leave-one-out cross-validation (Q2LOO) larger than 0.921 and 0.900 respectively. The created MLR and GA-PLS models indicated the importance of molecular size, degree of branching, flexibility, shape, three-dimensional coordination of different atoms in a molecule in inhibitory activities against MMP-2. The docking study indicated that lipophilic and hydrogen bonding interactions among the inhibitors and the receptor are involved in a ligand-receptor interaction. The oxygen of carbonyl and sulfonyl groups is important for hydrogen bonds of ligand with Leu82 and Ala83. R2 and R3 substituents play a main role in hydrogen bonding interactions. R1 is sited in the hydrophobic pocket. Methylene group can help a ligand to be fitted in the lipophilic pocket, so two methylene groups are better than one. The Phenyl group can create a π-π interaction with Phe86. Conclusions The QSAR and docking analyses demonstrated to be helpful tools in the prediction of anti-cancer activities and a guide to the synthesis of new metalloproteinase inhibitors based on L-tyrosine scaffold.


Introduction
The matrix metalloproteinases (MMPs) function predominantly as enzymes that degrade structural components of the extracellular matrix (ECM) [1][2][3][4]. MMPs play a substantial role in tumor progression and invasion of inflammatory cells. Among MMPs, MMP-2 easily digests the denatured collagens and gelatins [5,6]. It is highly involved in the process of tumor invasion and has been considered as a promising target for cancer therapy [3,7,8]. MMP-2 has a catalytic center) zinc (II) ion (and two hydrophobic domains (S1´pocket and S1 pocket). S1´pocket, the key domain of MMP-2, is deeper and narrower than other MMP subtypes and S1 pocket is a solvent exposure domain [3,9,10].
The structural requirements of an MMP-2 inhibitor are: (1) a functional group that binds the zinc ion (zincbinding group; ZBG), capable of chelating the active site zinc ion; (2) a functional group which interacts with the backbone of enzyme; (3) side chains that undergo effective interactions with the enzyme subsites, such as S1ṕ ocket and S1 pocket [3,11,12]. Cheng et al. studied the L-hydroxy proline scaffoldbased MMP-2 inhibitors in 2008 [13], and, in order to identify more potent MMP-2 inhibitors, replaced Lhydroxy proline with the L-tyrosine scaffold to form a new integrated structural pattern. They reported that the alteration in substitution pattern at R 1 , R 2 and R 3 positions alter MMP-2 inhibitory activity [1].
In 2012, 30 L-hydroxy tyrosine scaffold-based MMP-2 inhibitors were identified. It seems that finding a relationship between the structure of these compounds and their inhibitory activities in order to design structures with better activities and to predict their activity would be essential.
Quantitative structure activity relationships (QSARs), one of the most extensively used methods in chemometrics, and molecular docking are two of the helpful methods for drug design and prediction of drug activity [14][15][16]. QSAR models are mathematical equations which generate a communication between chemical structures and their biological activities, while molecular docking is done to specify the structural features that are important for interaction with a receptor.
In this report, we have performed a QSAR study and a molecular docking examination on 30 compounds of Ltyrosine derivatives which had been synthesized and evaluated as metalloproteinase (MMP-2) inhibitors [1].

QSAR
All calculations were implemented using an Intel Core i5 2.4 GHz processor, with the windows 7 operating system. Geometry optimization was done by Hyperchem 8.0 software. Descriptor generation was performed by Hyperchem 8.0, DRAGON package and Gaussian 98 W programs. SPSS software (version 11.5) and MATLAB software (version 7.12.0) were used for model creation and validation methods.

Activity data and descriptors generation
In this study, the biological data employed is MMP-2 inhibitory activity of 30 compounds. The synthesis and determination of biological activity of these inhibitors have been reported by Cheng et al. [1]. The structure of these compounds and their biological activity are shown in Table 1. The two-dimensional structures of molecules were drawn using Hyperchem 8.0 software. At the beginning, pre-optimization was conducted using the MM + molecular mechanic force field and then a more accurate optimization was performed with the semiempirical PM3 method. The optimization was performed using the Polak-Ribiere algorithm until the root mean square gradient reached 0.01 kcal/ (Å mol). Hyperchem 8.0 program was also used to calculate chemical descriptors including: surface area, molecular volume, hydration energy, octanol/water partition coefficient (logP), molar refractivity, molar polarisability and molar mass.
The output files of Hyperchem 8.0 software (.hin files) were transferred to DRAGON package to calculate four classes of descriptors (0D, 1D, 2D and 3D) including 28 constitutional descriptors, 10  The z-matrix files of compounds were also provided by Hyperchem 8.0 program (.zmt files) and then were transferred as input files to the Gaussian98W program [17]. Semi-empirical molecular orbital calculation by PM3 method was performed using Gaussian98W program. Different quantum chemical descriptors were obtained by this method including highest occupied molecular orbital energy (E HOMO ), lowest unoccupied molecular orbital energy (E LUMO ), and molecular dipole moment. Local charge was obtained by PM3 method in Gaussian98W. Hardness (η = 0.5 (E HOMO + E LUMO )), softness (S = 1/η), electronegativity (χ = 0.5 (E HOMO -E LUMO )) and electrophilicity (ω = χ 2 /2η) were calculated according to the method proposed by Thanikaivelan et al. [18].

Data processing and models building
All calculated descriptors were collected in a data matrix, D, the number on rows was representative of molecule numbers and the numbers on columns accounted for descriptors (30 × 1168). At the beginning, the columns which had constant and near constant values were removed from the original data matrix. Since collinear variables disrupt the models in MLR analysis, collinear descriptors needed to be detected and removed. The correlation of descriptors with one another and with activity data was then investigated. In the pairs with collinearity higher than 0.9, one which had the highest correlation versus the activity was retained and the rest were omitted. The number of total descriptors for each molecule reached 291. The data set (30 compounds) was split into a calibration set and validation set. Validation subset was made of 20% of the total data (here, 6 biological activity data). An MLR analysis was performed by the stepwise regression SPSS (version 12.0) software and the model was built.
Since the data splitting has a considerable influence on the final selected model, a combined data splitting-feature selection (CDFS) strategy was employed [19]. In the CDFS methodology, several subdivisions of calibration and validation set were made (10 times). In each case, the best model was chosen with a correlation coefficient higher than 0.95. The created models were validated by leave-one out (LOO) cross-validation method and Y-randomization test to investigate their predictability.

Molecular docking
Molecular docking has become an increasingly main tool for drug discovery. Docking is a method which predicts the preferred orientation of one ligand when bound in an active site to form a stable complex. In this study, molecular docking of L-tyrosine derivatives with MMP-2 was studied by AutoDock 4.2 program [20] to find their binding site and the best direction based on the binding energy. Pdb file of MMP-2 was obtained from www.pdb.org (PDB: 3AYU) [21]. All water molecules were deleted, polar hydrogens were added and the Kollman charge [22] was computed. A 120 x 120 x 120 Å size was chosen for the grid box, which covers the whole protein. In all of the ligands, the same as protein, polar hydrogens were added and the Gasteiger charge was computed [23]. After preparation of ligand and protein files, the map files were created. Docking process with 50 runs and maximum number of evaluations 2500000 were performed. The final .dlg files were analyzed and the interaction between ligands and the active site of protein were studied. The ligand-protein interactions were analyzed and visualized by Discovery studio Ver. 3.

Multiple linear regression analysis (MLR)
This study made use of an MLR analysis, as a simple regression method. The stepwise regression (using SPSS software) was also utilized to choose the most relevant set of descriptors for each type of the split data. The model coefficients were calculated using calibration data and then were used to predict the biological activity of validation samples. Several models were constructed by running a typical stepwise regression which is ranked based on calibration correlation coefficient (R 2 c). A model with calibration correlation coefficient higher than 0.9 was selected. The created model needed to be validated. The theoretical validation is generally classified into two groups: internal and external validation. Two of the internal validation methods include Leave-one out cross validation (LOO) and Y-randomization [24]. The offered model was evaluated for both over-fitting and avoidance of chance correlations by the leave-one-out cross-validation (LOO) method and Y-randomization test respectively.
MLR method was repeated several times by different splitting data. In each case, one model was proposed. Leave-one-out cross-validation was performed and lastly, the best ten models were obtained with R 2 and Q 2 LOO higher than 0.9 in as reported in Table 2. The statistical qualities of models were acceptable and all of them had Q 2 LOO larger than 0.92; hence, the predicted models can make over 92% of variances in the inhibitory activity. In addition, results were acceptable in the prediction sets. The values of prediction correlation coefficient (R 2 p ) are listed in Table 2 for the ten final models. All of the squared correlation coefficients were higher than 0.90, so the resultant linear models can predict 90% of variances in the inhibitory activity in the ten prediction sets. The total number of descriptors, which existed in all the ten models, was 22. These descriptors are briefly described in Table 3. Among these, 3 descriptors were common. Some of the descriptors such as E3u, HATS7e, RDF065m, SP20, RDF115e, G3m and dipx were observed only in one model. The repeated descriptors were IC1, Mor24m and Mor15e. IC1 descriptor existed in all the ten models. IC1 is topological descriptor and Mor24m and Mor15e are 3D-MoRSE ones.
The repetition of IC1 in all the ten models indicated that this descriptor has a main effect on L-tyrosine scaffoldbased MMP-2 inhibitors. 3D-MoRSE descriptors also have significant effects. The quantum descriptors such as (dipx) that were seen only in one model have lower effects on MMP-2 inhibitors. The observation of the models, as listed in Table 2, revealed that there is a high degree of resemblance between the subset of descriptors (without considering the difference between the chosen descriptors in different models).
To create a general model, all of the 22 descriptors were used and an MLR analysis was applied with the stepwise variable selections. Finally, one model with balance between the highest R 2 , Q 2 LOO and the lowest number of descriptors was opted for further analysis, as reported in MLR Equation In this equation, N represents the number of molecules in the calibration set. R 2 c and Q 2 LOO are respectively the squared correlation coefficients for calibration and cross-validation. In addition, S.E is the standard error of calibration and RMS CV is the root mean square error of cross-validation. This equation has three common descriptors (IC1, Mor24m and Mor15e), with high calibration statistics and prediction ability.
Topological descriptors, such as IC1, explain the size, degree of branching, flexibility and overall shape. 3D-MoRSE descriptors explain three-dimensional coordination of the different atoms in a molecule. IC1, Mor24m and Mor15e have positive signs which indicate that pIC 50 is directly related to these descriptors. The radial distribution function (RDF) descriptors are based on the distance distribution in the molecule. RDF115m has a negative sign which indicates that pIC 50 is inversely related to this descriptor.
For the MMP-2 inhibition activity, a higher value of molecular size, degree of branching, flexibility, shape, three-dimensional coordination of the different atoms in a molecule (IC1) and a lower value of radial distribution function, RDF115m, are beneficial to the activity.
The general model has a Q 2 LOO equal to 0.921; hence, the predicted model can make over 92% of variances in the inhibitory activity. The predicted values of pIC 50 were obtained for all the molecules by MLR equation which was listed in Table 1 and were plotted against the experimental values ( Figure 1A). The Y-randomization test was performed to evaluate the robustness of the general model. In the Y-randomization test, the activity data were randomly permuted in the original model and 10 models were generated. All of the models were expected to have lower R 2 and Q 2 LOO values than the original MLR model. If the reverse occurs, a suitable MLR model cannot be generated. The lower R 2 and Q 2 LOO values are shown in Table 4.

Genetic algorithm partial least squares (GA-PLS)
Since the variable selection method, that is the stepwise regression, may not be a suitable selection, variable selection methods such as the genetic algorithm, which demonstrate much better outcomes in comparison with the stepwise regression, are more favorable [25,26]. In genetic algorithm, if its corresponding descriptor is contained, a gene receives a value of 1 in the subset; otherwise, it takes a value of zero. The number of genes at each chromosome is equivalent to the number of descriptors. The population size varied between 80 and 125 for different GA runs. The number of genes with the values of 1 was relatively lower than the number of genes with the values of 0. The genes with the values of 1 were maintained. The chromosomes with the smallest numbers of chosen descriptors (total number of descriptors for each molecule reached 105) and the highest fitness were selected as the intended model. The predicted model was tested by leave-n-out cross-validation [27]. A leave-one-out cross-validation was triggered and the value of Q 2 LOO was obtained 0.850. In GA-PLS, the resulted model with higher crossvalidation statistics is reported in Equation 2 and the predicted values of pIC 50 are shown in Table 1 and plotted against the experimental values in Figure 1B The Y-randomization test was performed to evaluate the robustness of the created model in GA-PLS and 10 models were generated. All of the models were expected to have lower R 2 and Q 2 LOO values than the original GA-PLS model, as shown in Table 4.

Applicability domain of the model
A QSAR model is exploited to screen new compounds when its domain of application has been defined [28]. The prediction may be assumed reliable for only those  compounds which fall into this domain [29]. Standardized residuals of the activity were computed and were plotted versus leverage values (h). The value of leverage was calculated for every compound. Values are always between 0 and 1. A value of 0 is indicative of perfect prediction and usually is not accessible, and a value of 1 indicates very poor prediction. The lower the value, the higher confidence in the prediction. Warning leverage (h*) is another standard for explanation of the results and is, generally, fixed at 3 (k + 1)/ n, where k is the number of model parameters and n is the number of calibration set [29]. Calculated leverage for calibration set is useful for determining the compounds which affect the model and, in terms of validation set, useful for assigning the applicability domain of the model. The William's plot for the developed models in MLR and GA-PLS are shown in Figure 2.
Response outliers are compounds that have standard residual points higher than ± 2.0 standard deviation units and a leverage value higher than the warning leverage, which is 0.75 and 0.5 for MLR and GA-PLS respectively. As can be seen in Figure 2, all studied molecules in calibration and validation set lie with high degree of confidence in application domain of the developed models in both methods.

Molecular docking studies
To explore the ligand binding modes, and to find amino acids, which are essential in ligand binding to MMP-2, molecular docking was performed on a ligand-binding pocket. The way the compound was bound with the lowest free energy was studied. Interactions between MMP-2 and ligand N-4-[(1-hydroxycarbamoyl-2-methyl-propyl)-(2-morpholin-4-yl-ethyl)-sulfamoyl]-4-pentyl-benzamide (SC-74020) were obtained by Feng et al. and were reported. According to their results, the catalytic zinc is chelated by His120, His124, His130 and ligand, and the structural zinc is coordinated by His70, Asp72, His85 and His98. In addition, hydrogen bond was bound by Leu82 and Ala83 to a sulfoneoxygen atom of the inhibitor [30].  Initially, to assure binding mode of ligand and protein, ligand docking with MMP-2 protein has been validated by Feng. All of the interactions between the ligand and catalytic zinc with the protein from our results are shown in Figure 3. Root mean square deviation (RSMD) was lower than 1. Both of the prior cases proposed high reliability of docking program. Therefore, the AutoDock 4.2 could be used to find the binding mode of other inhibitors of MMP-2.
All of the 30 compounds were docked into the binding site of protein by AutoDock 4.2 and were studied. The binding energy of all the compounds is reported in Table 5. The obtained energies were compared with the experimental IC 50 , and the 6a-6j compounds have the lowest binding energy. This means that based on the binding energy of the active site, these compounds, especially compound (6 h), are the best L-tyrosine scaffold based inhibitors. The binding orientation of compound (6 h) and hydrogen bond of the ligand with Leu82 and Ala83 are depicted in Figure 4. This compound was fitted into the active site of MMP-2. In all of the compounds, the oxygen of carbonyl and sulfonyl groups are important for hydrogen bond of the ligand with Leu82 and Ala83. R 2 and R 3 substituents have the main role for hydrogen bonding interactions. In 4a-4j compounds, IC 50 is higher than 5a-5j and 6a-6j compounds. When R 3 is OH and NHOH groups, hydrogen bond can be created better than when R 3 is -OCH 3 . Moreover, when R 2 is sulfonyl, because of two oxygen groups, a stronger hydrogen bond can be created than that of carbonyl. R 1 is sited in hydrophobic pocket. Methylene group can help the ligand to be fitted in lipophilic pocket, so two Methylene groups are better than one. Phenyl group can create a π-π interaction with Phe86. Therefore, phenethyl (C 6 H 5 CH 2 CH 2 ) is better than benzyl (C 6 H 5 CH 2 ).
To design L-tyrosine based inhibitors and prediction of their activity against MMP-2 based on AutoDock 4.2 and QSAR studies, addition to a higher value of molecular size, degree of branching, flexibility, shape and threedimensional coordination of the different atoms in a molecule, hydrophobicity and hydrophilicity of R 1 , R 2 and R 3 are highly important. When R 1 is more hydrophobic, and R 2 , is more hydrophilic, there is a stronger inhibition against MMP-2.

Conclusion
In this study, quantitative relationships between molecular structure and inhibitory effect of L-tyrosine scaffold based Figure 4 The best orientation of 6 h ligand. MMP-2 inhibitors were investigated by MLR and GA-PLS. In MLR, A combined data splitting-feature selection method (CDFS) was offered to develop a quantitative structure-activity relationship model. It was found that topological parameter (IC1) has a main effect on the inhibitory activity of the compounds, among the different QSAR models. By this simple procedure, a multilinear 5parametric model was created out of 22 descriptors. This method yielded acceptable results for the prediction set which was measured by cross-validation and Yrandomization. The findings indicate that the linear model produced by CDFS methodology could reproduce more than 92% of variances in the inhibitory activity. In addition to MLR, genetic algorithms, which demonstrated much better outcomes in comparison with stepwise regression, were used. By GA-PLS, the model with higher crossvalidation statistics was created and the results indicated that IC1, Mor24m, Mor15e and Mor32e are main descriptors. It can be concluded from the two methods that higher values of molecular size, degree of branching, flexibility, shape and three-dimensional coordination of the different atoms in a molecule are particularly important The docking study revealed that two hydrogen bonds between all of the inhibitors and the active site of MMP-2 (Leu82 and Ala83) are formed, as well as a π-π interaction between phenyl group and Phe86.
The information above could be used to design new inhibitors, and to show higher inhibitory activities and chemical synthesis of new inhibitors.