Introduction

The matrix metalloproteinases (MMPs) function predominantly as enzymes that degrade structural components of the extracellular matrix (ECM) [1-4]. MMPs play a substantial role in tumor progression and invasion of inflammatory cells. Among MMPs, MMP-2 easily digests the denatured collagens and gelatins [5,6]. It is highly involved in the process of tumor invasion and has been considered as a promising target for cancer therapy [3,7,8]. MMP-2 has a catalytic center) zinc (II) ion (and two hydrophobic domains (S1´ pocket and S1 pocket). S1´ pocket, the key domain of MMP-2, is deeper and narrower than other MMP subtypes and S1 pocket is a solvent exposure domain [3,9,10].

The structural requirements of an MMP-2 inhibitor are: (1) a functional group that binds the zinc ion (zinc-binding group; ZBG), capable of chelating the active site zinc ion; (2) a functional group which interacts with the backbone of enzyme; (3) side chains that undergo effective interactions with the enzyme subsites, such as S1´ pocket and S1 pocket [3,11,12].

Cheng et al. studied the L–hydroxy proline scaffold-based MMP-2 inhibitors in 2008 [13], and, in order to identify more potent MMP-2 inhibitors, replaced L-hydroxy proline with the L-tyrosine scaffold to form a new integrated structural pattern. They reported that the alteration in substitution pattern at R1, R2 and R3 positions alter MMP-2 inhibitory activity [1].

In 2012, 30 L–hydroxy tyrosine scaffold-based MMP-2 inhibitors were identified. It seems that finding a relationship between the structure of these compounds and their inhibitory activities in order to design structures with better activities and to predict their activity would be essential.

Quantitative structure activity relationships (QSARs), one of the most extensively used methods in chemometrics, and molecular docking are two of the helpful methods for drug design and prediction of drug activity [14-16]. QSAR models are mathematical equations which generate a communication between chemical structures and their biological activities, while molecular docking is done to specify the structural features that are important for interaction with a receptor.

In this report, we have performed a QSAR study and a molecular docking examination on 30 compounds of L-tyrosine derivatives which had been synthesized and evaluated as metalloproteinase (MMP-2) inhibitors [1].

Materials and methods

QSAR

All calculations were implemented using an Intel Core i5 2.4 GHz processor, with the windows 7 operating system. Geometry optimization was done by Hyperchem 8.0 software. Descriptor generation was performed by Hyperchem 8.0, DRAGON package and Gaussian 98 W programs. SPSS software (version 11.5) and MATLAB software (version 7.12.0) were used for model creation and validation methods.

Activity data and descriptors generation

In this study, the biological data employed is MMP-2 inhibitory activity of 30 compounds. The synthesis and determination of biological activity of these inhibitors have been reported by Cheng et al. [1]. The structure of these compounds and their biological activity are shown in Table 1. The two-dimensional structures of molecules were drawn using Hyperchem 8.0 software. At the beginning, pre-optimization was conducted using the MM+ molecular mechanic force field and then a more accurate optimization was performed with the semi-empirical PM3 method. The optimization was performed using the Polak–Ribiere algorithm until the root mean square gradient reached 0.01 kcal/ (Å mol). Hyperchem 8.0 program was also used to calculate chemical descriptors including: surface area, molecular volume, hydration energy, octanol/water partition coefficient (logP), molar refractivity, molar polarisability and molar mass.

Table 1 Chemical structures of L-tyrosine derivatives and their experimental and predicted activity by MLR and GA-PLS models

The output files of Hyperchem 8.0 software (.hin files) were transferred to DRAGON package to calculate four classes of descriptors (0D, 1D, 2D and 3D) including 28 constitutional descriptors, 10 functional groups, 18 atom-centered fragments, 216 topological descriptors, 15 molecular walk counts, 64 BCUT descriptors, 24 Galvestopol.charge indices, 96 2D autocorrelations, 14 charge descriptors, 41 Randic molecular profiles, 27 geometrical descriptors, 150 radial distribution function descriptors (RDF), 160 3D-MoRSE descriptors, 99 WHIM descriptors and 196 GETAWAY descriptors.

The z-matrix files of compounds were also provided by Hyperchem 8.0 program (.zmt files) and then were transferred as input files to the Gaussian98W program [17]. Semi-empirical molecular orbital calculation by PM3 method was performed using Gaussian98W program. Different quantum chemical descriptors were obtained by this method including highest occupied molecular orbital energy (EHOMO), lowest unoccupied molecular orbital energy (ELUMO), and molecular dipole moment. Local charge was obtained by PM3 method in Gaussian98W. Hardness (η = 0.5 (EHOMO + ELUMO)), softness (S = 1/η), electronegativity (χ = 0.5 (EHOMO - ELUMO)) and electrophilicity (ω = χ2/2η) were calculated according to the method proposed by Thanikaivelan et al. [18].

Data processing and models building

All calculated descriptors were collected in a data matrix, D, the number on rows was representative of molecule numbers and the numbers on columns accounted for descriptors (30 × 1168). At the beginning, the columns which had constant and near constant values were removed from the original data matrix. Since collinear variables disrupt the models in MLR analysis, collinear descriptors needed to be detected and removed. The correlation of descriptors with one another and with activity data was then investigated. In the pairs with collinearity higher than 0.9, one which had the highest correlation versus the activity was retained and the rest were omitted. The number of total descriptors for each molecule reached 291.

The data set (30 compounds) was split into a calibration set and validation set. Validation subset was made of 20% of the total data (here, 6 biological activity data). An MLR analysis was performed by the stepwise regression SPSS (version 12.0) software and the model was built.

Since the data splitting has a considerable influence on the final selected model, a combined data splitting-feature selection (CDFS) strategy was employed [19]. In the CDFS methodology, several subdivisions of calibration and validation set were made (10 times). In each case, the best model was chosen with a correlation coefficient higher than 0.95. The created models were validated by leave-one out (LOO) cross-validation method and Y-randomization test to investigate their predictability.

Molecular docking

Molecular docking has become an increasingly main tool for drug discovery. Docking is a method which predicts the preferred orientation of one ligand when bound in an active site to form a stable complex. In this study, molecular docking of L-tyrosine derivatives with MMP-2 was studied by AutoDock 4.2 program [20] to find their binding site and the best direction based on the binding energy. Pdb file of MMP-2 was obtained from www.pdb.org (PDB: 3AYU) [21]. All water molecules were deleted, polar hydrogens were added and the Kollman charge [22] was computed. A 120 x 120 x 120 Å size was chosen for the grid box, which covers the whole protein. In all of the ligands, the same as protein, polar hydrogens were added and the Gasteiger charge was computed [23]. After preparation of ligand and protein files, the map files were created. Docking process with 50 runs and maximum number of evaluations 2500000 were performed. The final .dlg files were analyzed and the interaction between ligands and the active site of protein were studied. The ligand-protein interactions were analyzed and visualized by Discovery studio Ver. 3.

Result and discussion

Multiple linear regression analysis (MLR)

This study made use of an MLR analysis, as a simple regression method. The stepwise regression (using SPSS software) was also utilized to choose the most relevant set of descriptors for each type of the split data. The model coefficients were calculated using calibration data and then were used to predict the biological activity of validation samples. Several models were constructed by running a typical stepwise regression which is ranked based on calibration correlation coefficient (R2c). A model with calibration correlation coefficient higher than 0.9 was selected. The created model needed to be validated. The theoretical validation is generally classified into two groups: internal and external validation. Two of the internal validation methods include Leave-one out cross validation (LOO) and Y-randomization [24]. The offered model was evaluated for both over-fitting and avoidance of chance correlations by the leave-one-out cross-validation (LOO) method and Y-randomization test respectively.

MLR method was repeated several times by different splitting data. In each case, one model was proposed. Leave-one-out cross-validation was performed and lastly, the best ten models were obtained with R2 and Q2 LOO higher than 0.9 in as reported in Table 2. The statistical qualities of models were acceptable and all of them had Q2 LOO larger than 0.92; hence, the predicted models can make over 92% of variances in the inhibitory activity. In addition, results were acceptable in the prediction sets. The values of prediction correlation coefficient (R2 p) are listed in Table 2 for the ten final models. All of the squared correlation coefficients were higher than 0.90, so the resultant linear models can predict 90% of variances in the inhibitory activity in the ten prediction sets.

Table 2 The best ten models were selected for future analysis in MLR

The total number of descriptors, which existed in all the ten models, was 22. These descriptors are briefly described in Table 3. Among these, 3 descriptors were common. Some of the descriptors such as E3u, HATS7e, RDF065m, SP20, RDF115e, G3m and dipx were observed only in one model. The repeated descriptors were IC1, Mor24m and Mor15e. IC1 descriptor existed in all the ten models. IC1 is topological descriptor and Mor24m and Mor15e are 3D-MoRSE ones.

Table 3 Brief description of the descriptors in ten models

The repetition of IC1 in all the ten models indicated that this descriptor has a main effect on L-tyrosine scaffold-based MMP-2 inhibitors. 3D-MoRSE descriptors also have significant effects. The quantum descriptors such as (dipx) that were seen only in one model have lower effects on MMP-2 inhibitors. The observation of the models, as listed in Table 2, revealed that there is a high degree of resemblance between the subset of descriptors (without considering the difference between the chosen descriptors in different models).

To create a general model, all of the 22 descriptors were used and an MLR analysis was applied with the stepwise variable selections. Finally, one model with balance between the highest R2, Q2 LOO and the lowest number of descriptors was opted for further analysis, as reported in MLR Equation 1:

$$ \mathrm{pI}{\mathrm{C}}_{50} = \hbox{-} 13.428\ \left( \pm 1.720\right) + 5.965\ \left( \pm 0.488\right)\ \mathrm{I}\mathrm{C}1 + 1.464\ \left( \pm 0.618\right)\ \mathrm{Mor}24\mathrm{m}\ \hbox{-}\ 0.187\ \left(\pm 0.044\right) $$
$$ \mathrm{R}\mathrm{D}\mathrm{F}115\mathrm{m} + 0.307\ \left(\pm 0.067\right)\ \mathrm{S}\mathrm{P}20 + 1.284\ \left(\pm 0.514\right)\ \mathrm{Mor}15\mathrm{e} $$
(1)
$$ \mathrm{N} = 24,\ {{\mathrm{R}}^2}_{\mathrm{c}} = 0.958,\ \mathrm{S}.\mathrm{E} = 0.285,\ {{\mathrm{Q}}^2}_{\mathrm{LOO}} = 0.921,\ \mathrm{R}\mathrm{M}{\mathrm{S}}_{\mathrm{CV}} = 0.346 $$

In this equation, N represents the number of molecules in the calibration set. R2 c and Q2 LOO are respectively the squared correlation coefficients for calibration and cross-validation. In addition, S.E is the standard error of calibration and RMSCV is the root mean square error of cross-validation. This equation has three common descriptors (IC1, Mor24m and Mor15e), with high calibration statistics and prediction ability.

Topological descriptors, such as IC1, explain the size, degree of branching, flexibility and overall shape. 3D-MoRSE descriptors explain three-dimensional coordination of the different atoms in a molecule. IC1, Mor24m and Mor15e have positive signs which indicate that pIC50 is directly related to these descriptors. The radial distribution function (RDF) descriptors are based on the distance distribution in the molecule. RDF115m has a negative sign which indicates that pIC50 is inversely related to this descriptor.

For the MMP-2 inhibition activity, a higher value of molecular size, degree of branching, flexibility, shape, three-dimensional coordination of the different atoms in a molecule (IC1) and a lower value of radial distribution function, RDF115m, are beneficial to the activity.

The general model has a Q2 LOO equal to 0.921; hence, the predicted model can make over 92% of variances in the inhibitory activity. The predicted values of pIC50 were obtained for all the molecules by MLR equation which was listed in Table 1 and were plotted against the experimental values (Figure 1A).

Figure 1
figure 1

Plot of predicted pIC50 versus the experimental values for MLR model (A), and GA-PLS model (B).

The Y-randomization test was performed to evaluate the robustness of the general model. In the Y-randomization test, the activity data were randomly permuted in the original model and 10 models were generated. All of the models were expected to have lower R2 and Q2 LOO values than the original MLR model. If the reverse occurs, a suitable MLR model cannot be generated. The lower R2 and Q2 LOO values are shown in Table 4.

Table 4 R 2 and Q 2 LOO values after ten Y-randomization tests

Genetic algorithm partial least squares (GA-PLS)

Since the variable selection method, that is the stepwise regression, may not be a suitable selection, variable selection methods such as the genetic algorithm, which demonstrate much better outcomes in comparison with the stepwise regression, are more favorable [25,26].

In genetic algorithm, if its corresponding descriptor is contained, a gene receives a value of 1 in the subset; otherwise, it takes a value of zero. The number of genes at each chromosome is equivalent to the number of descriptors. The population size varied between 80 and 125 for different GA runs. The number of genes with the values of 1 was relatively lower than the number of genes with the values of 0. The genes with the values of 1 were maintained. The chromosomes with the smallest numbers of chosen descriptors (total number of descriptors for each molecule reached 105) and the highest fitness were selected as the intended model. The predicted model was tested by leave-n-out cross-validation [27]. A leave-one-out cross-validation was triggered and the value of Q2 LOO was obtained 0.850.

In GA-PLS, the resulted model with higher cross-validation statistics is reported in Equation 2 and the predicted values of pIC50 are shown in Table 1 and plotted against the experimental values in Figure 1B.

$$ \mathrm{pI}{\mathrm{C}}_{50} = \hbox{-} 12.589\ \left( \pm 1.208\right) + 6.363\ \left( \pm 0.394\right)\ \mathrm{I}\mathrm{C}1 + 2.119\ \left( \pm 0.485\right)\ \mathrm{Mor}24\mathrm{m}\ \hbox{-}\ 0.665\ \left( \pm 0.189\right) $$
$$ \mathrm{Mor}15\mathrm{e}\ \hbox{-}\ 0.784\ \left( \pm 0.370\right)\ \mathrm{Mor}32\mathrm{e} $$
(2)
$$ \mathrm{N} = 24\kern1.5em {{\mathrm{R}}^2}_{\mathrm{c}} = 0.932\kern1em \mathrm{S}.\mathrm{E} = 0.328\kern1em {{\mathrm{Q}}^2}_{\mathrm{LOO}} = 0.900\kern3.25em \mathrm{RMSCV} = 0.391 $$

This equation has three MLR descriptors (IC1, Mor24m and Mor15e) with high calibration statistics and prediction ability. IC1 and Mor24m have a display positive effect and Mor15e and Mor32e have a display negative effect on inhibitory activity in GA-PLS. Unlike MLR, The radial distribution function (RDF) descriptors have no effect on inhibitory activity in GA-PLS.

As observed from Equation 2, the variables used in this equation can explain 93.2% of the variances in the inhibitory activity of the MMP-2 inhibitors used in this study.

The Y-randomization test was performed to evaluate the robustness of the created model in GA-PLS and 10 models were generated. All of the models were expected to have lower R2 and Q2 LOO values than the original GA-PLS model, as shown in Table 4.

Applicability domain of the model

A QSAR model is exploited to screen new compounds when its domain of application has been defined [28]. The prediction may be assumed reliable for only those compounds which fall into this domain [29]. Standardized residuals of the activity were computed and were plotted versus leverage values (h). The value of leverage was calculated for every compound. Values are always between 0 and 1. A value of 0 is indicative of perfect prediction and usually is not accessible, and a value of 1 indicates very poor prediction. The lower the value, the higher confidence in the prediction. Warning leverage (h*) is another standard for explanation of the results and is, generally, fixed at 3 (k + 1)/ n, where k is the number of model parameters and n is the number of calibration set [29]. Calculated leverage for calibration set is useful for determining the compounds which affect the model and, in terms of validation set, useful for assigning the applicability domain of the model. The William’s plot for the developed models in MLR and GA-PLS are shown in Figure 2.

Figure 2
figure 2

William’s plot of generated MLR model (A), and GA-PLS model (B).

Response outliers are compounds that have standard residual points higher than ± 2.0 standard deviation units and a leverage value higher than the warning leverage, which is 0.75 and 0.5 for MLR and GA-PLS respectively. As can be seen in Figure 2, all studied molecules in calibration and validation set lie with high degree of confidence in application domain of the developed models in both methods.

Molecular docking studies

To explore the ligand binding modes, and to find amino acids, which are essential in ligand binding to MMP-2, molecular docking was performed on a ligand-binding pocket. The way the compound was bound with the lowest free energy was studied. Interactions between MMP-2 and ligand N-4-[(1-hydroxycarbamoyl-2-methyl-propyl)-(2-morpholin-4-yl-ethyl)-sulfamoyl]-4- pentyl-benzamide (SC-74020) were obtained by Feng et al. and were reported. According to their results, the catalytic zinc is chelated by His120, His124, His130 and ligand, and the structural zinc is coordinated by His70, Asp72, His85 and His98. In addition, hydrogen bond was bound by Leu82 and Ala83 to a sulfoneoxygen atom of the inhibitor [30]. Initially, to assure binding mode of ligand and protein, ligand docking with MMP-2 protein has been validated by Feng. All of the interactions between the ligand and catalytic zinc with the protein from our results are shown in Figure 3. Root mean square deviation (RSMD) was lower than 1. Both of the prior cases proposed high reliability of docking program. Therefore, the AutoDock 4.2 could be used to find the binding mode of other inhibitors of MMP-2.

Figure 3
figure 3

Interactions between ligand and MMP-2.

All of the 30 compounds were docked into the binding site of protein by AutoDock 4.2 and were studied. The binding energy of all the compounds is reported in Table 5. The obtained energies were compared with the experimental IC50, and the 6a-6j compounds have the lowest binding energy. This means that based on the binding energy of the active site, these compounds, especially compound (6 h), are the best L-tyrosine scaffold based inhibitors. The binding orientation of compound (6 h) and hydrogen bond of the ligand with Leu82 and Ala83 are depicted in Figure 4. This compound was fitted into the active site of MMP-2. In all of the compounds, the oxygen of carbonyl and sulfonyl groups are important for hydrogen bond of the ligand with Leu82 and Ala83. R2 and R3 substituents have the main role for hydrogen bonding interactions. In 4a-4j compounds, IC50 is higher than 5a-5j and 6a-6j compounds. When R3 is OH and NHOH groups, hydrogen bond can be created better than when R3 is –OCH3. Moreover, when R2 is sulfonyl, because of two oxygen groups, a stronger hydrogen bond can be created than that of carbonyl.

Table 5 The obtained binding energy by AutoDock
Figure 4
figure 4

The best orientation of 6 h ligand.

R1 is sited in hydrophobic pocket. Methylene group can help the ligand to be fitted in lipophilic pocket, so two Methylene groups are better than one. Phenyl group can create a π- π interaction with Phe86. Therefore, phenethyl (C6H5CH2CH2) is better than benzyl (C6H5CH2).

To design L-tyrosine based inhibitors and prediction of their activity against MMP-2 based on AutoDock 4.2 and QSAR studies, addition to a higher value of molecular size, degree of branching, flexibility, shape and three-dimensional coordination of the different atoms in a molecule, hydrophobicity and hydrophilicity of R1, R2 and R3 are highly important. When R1 is more hydrophobic, and R2, is more hydrophilic, there is a stronger inhibition against MMP-2.

Conclusion

In this study, quantitative relationships between molecular structure and inhibitory effect of L-tyrosine scaffold based MMP-2 inhibitors were investigated by MLR and GA-PLS. In MLR, A combined data splitting-feature selection method (CDFS) was offered to develop a quantitative structure–activity relationship model. It was found that topological parameter (IC1) has a main effect on the inhibitory activity of the compounds, among the different QSAR models. By this simple procedure, a multilinear 5-parametric model was created out of 22 descriptors. This method yielded acceptable results for the prediction set which was measured by cross-validation and Y-randomization. The findings indicate that the linear model produced by CDFS methodology could reproduce more than 92% of variances in the inhibitory activity. In addition to MLR, genetic algorithms, which demonstrated much better outcomes in comparison with stepwise regression, were used. By GA-PLS, the model with higher cross-validation statistics was created and the results indicated that IC1, Mor24m, Mor15e and Mor32e are main descriptors. It can be concluded from the two methods that higher values of molecular size, degree of branching, flexibility, shape and three-dimensional coordination of the different atoms in a molecule are particularly important

The docking study revealed that two hydrogen bonds between all of the inhibitors and the active site of MMP-2 (Leu82 and Ala83) are formed, as well as a π-π interaction between phenyl group and Phe86.

The information above could be used to design new inhibitors, and to show higher inhibitory activities and chemical synthesis of new inhibitors.