1 Background

Epidermal growth factor receptor (EGFR) from the receptor tyrosine kinases (RTKs) family, is known to be one of the most useful therapeutic targets for the mitigation of cancer most especially NSCLC. It plays a vital role in the regulation of cancer cell survival, migration, growth, proliferation, and differentiation [25, 27].

The most common and deadly type of all cancers around the globe is lung cancer which account for 25% of the cancer deaths every year ([6, 19]. Among the types of lung cancer with about 1.5 million patients and less than 20% survival rate is NSCLC [20].

In order to mitigate the problem of NSCLC, several medications were developed for up to 3 different generations. The first generation was designed for the treatment of EGFRL858R mutations, examples were gefitinib and erlotinib [16, 23]. The second generation was designed to treat EGFRT790M mutations examples are afatinib, dacomitinib and neratinib. The second generation drugs share a common structural features of quinazoline pharmacophore and acrylamide structure [18]. While in the case of the third generation, they were developed to treat EGFRT790M/L790M double mutations example AZD9291 [7].

QSAR is a computational chemistry method which correlate experimental activities (response variable) and physicochemical properties (molecular descriptors) of a compound quantitatively [11]. Furthermore, QSAR technique of computer-aided drug design plays a crucial role in physical, analytical, pharmaceutical, medicinal, organic chemistry, biochemistry, toxicology, chemical engineering, environmental sciences, and nanotechnology [2]. Another computational chemistry technique used to explore the interaction between 3D structures of a ligand and a receptor is molecular docking and contributes in virtual screening of library of compounds in computer-aided drug design. Pharmacokinetics played vital role in drug research and development by predicting ADME and drug likeness properties of drugs in hit-to-lead and lead-optimization campaigns [9].

The main aim of this work is to generate a valid QSAR model with a very high predictive power on some EGFRWT inhibitors using QSAR technique, study the nature of interactions between the EGFRWT inhibitors and EGFR enzyme via docking, and also to predict pharmacokinetics properties of these EGFRWT inhibitors.

2 Method

2.1 Dataset collection

Thirty nine (39) sets of EGFRWT inhibitors with their corresponding inhibitory activities (IC50) in nanomolar were retrieved from the work of [17] and used in this research. The inhibitory activities (IC50) of these molecules were then converted to their corresponding negative logarithms (pIC50) using Eq. 1 [3].

$$ {\mathrm{pIC}}_{50}=-\log\;{\mathrm{IC}}_{50}\times {10}^{-9} $$
(1)

2.2 Structure generation and stable geometry calculations

The initial step in any QSAR modeling study after data collection is drawing of the structures of the studied molecules. For this reason, the structures of all the studied molecules were generated utilizing the ChemDraw software [10]. After structure generation of the studied molecules, constraint in the structures was reduced via energy minimizing before finding the most stable structures of the studied molecules on potential energy surface using the Spartan 14 software. DFT at B3LYP/6-311G* level of theory was used in finding the most stable structures of all the studied molecules on global minima on the potential energy surface (PES) [15].

2.3 1D, 2D, and 3D descriptors generation, data pre-treatment, and dataset splitting

For the generation of the independent variables (descriptors), the most stable structures obtained in section 2.2 above were saved in a file format (SDF) that has been recognized by the software used in generation of descriptors, PaDEL descriptor tool kit [26].

The dataset was pre-treated manually to eliminate redundant and constant descriptors. After pre-treating the data, the Data division software was further used in dividing the data into training set and test set utilizing the Kennard-Stone algorithm [14]. The model building/training set were used for the generation of the models, and the validation/test set were used for assessing the generated models [11].

2.4 Model development

The models were developed utilizing the genetic function approximation (GFA) method with the actual pIC50 as the response variable and the descriptors as independent variables. In the case of variable selection, GFA selects most highly correlated descriptors to develop so many models which is one of the distinct characteristic of GFA.

2.5 Validation of the selected model

The most widely used assessment terms for QSAR models are the following; square correlation coefficient of the training set (R2training), adjusted R2 (R2adj), cross-validation coefficient (Qcv2), and square correlation coefficient of the test set (R2test). The high value of these parameters appears to be necessary but not enough [21].

In-view of this, the inter-correlation between descriptors can be detected using their variation inflation factors (VIF), to see whether these descriptors are highly correlated with one another or not. If the computed VIF values is up to 1 it means there is no inter-correlation between the descriptors; if it falls between 1–5, the model can be accepted, and if it is higher than 10, the model cannot be accepted. It can be calculated using the equation below:

$$ \mathrm{VIF}=\frac{1}{1-{R}^2} $$
(2)

where R2 is the correlation coefficient of the selected model [5].

The evaluation of significance and contribution of each descriptor to the selected model is performed using the value of mean effect of each descriptor. The mean effect is defined by the equation below:

$$ {\mathrm{MF}}_{\mathrm{j}}=\frac{B_j{\sum}_{j=1}^{i=n}{d}_{ij}}{\sum_j^m{B}_j{\sum}_i^n dij} $$
(3)

where MFj is the mean effect of a descriptor j in a model, βj is the coefficient of the descriptor J in that model and dij is the value of the descriptor in the data matrix for each molecule in the model building set, m is the number of descriptor that appear in the model and n is the number of molecules in the model building set [4]

To assure the robustness of a QSAR model and that the model was not obtained by chance correlation Y-Scrambling test was perform. It is done by reshuffling the actual activities and keeping the descriptors unchanged to generate new QSAR models for several trials, the new built QSAR models were anticipated to give low Q2 and R2 value. The validation parameter for this test is cRp (cR2p > 0.5) [12].

2.6 Applicability domain

A QSAR model is considered valid and void, if it is subjected to the applicability domain (AD) and found that the model can make good prediction of new activities of the training and test molecules. As such, the model is subjected to AD to find out whether there are influential or outliers molecules in the studied ones [22]. One of the methods used in assessing the AD is leverage approach and is given as hi:

$$ {h}_i={x}_i{\left({X}^TX\right)}^{-K}{x}_i^T\kern1em \left(i=A,\dots, \kern0.5em Z\right) $$
(4)

where the training set matrix I is given by xi, n × k descriptor matrix of the training set is represented by X, and XT is the transpose matrix X used in generating the model. The threshold for the value of X is the warning threshold (h*) which is presented in the equation below:

$$ {h}^{\ast }=3\left(x+1\right)/q $$
(5)

where the number of chemicals of the model building set is given by q, and the number of the descriptors in the model under evaluation is represented by x.

2.7 Molecular docking

A Dell Latitude E6520 computer system, with the following specification: Intel ® Core™ i7 Dual CPU,M330 @2.75 GHz 2.75 GHz, 8 GB of RAM was utilized to explore the nature of interactions between the active site of EGFR enzyme and five most active EGFRWT inhibitors (ligands) with the help of the Pyrex virtual screening software, Chimera, PyMOL, and Discovery studio.

Before the docking analysis, ligands were prepared from the optimized structures in section 2.2 above and saved in pdb file format using Spartan’14 [1]. The 3D structure of EGFR enzyme was downloaded from the protein data bank (with pdb ID: 4zau). The enzyme was prepared with the help of Discovery Studio Visualizer for the docking analysis; in the course of the preparation, hydrogen was added. Water molecule, heteroatoms, and co-ligands were eliminated from the crystal structure saved in pdb file.

The docking of the ligands to the active site of EGFR enzyme was achieved with the help of the Pyrex software using Autodock vina [11]. After successful docking protocol, re-formation of the complexes (ligand-receptor) for further investigation was also achieved utilizing the Chimera software. Discovery studio visualizer and PyMOL were used to investigate the interactions of the complexes.

2.8 Pharmacokinetics

Pharmacokinetics studies of five (5) most active compounds among the data set was carried out using SwissADME a free web tool used in evaluating ADME and drug-likeness properties of small molecules [8]. The Lipinski’s rule of five is useful at pre-clinical stage of drug discovery which state that if any chemical violate more than 2 of these criteria (molecular weight ˂ 500, number of hydrogen bond donors ≤ 5, number of hydrogen bond acceptors ≤ 10, calculated Log p ≤ 5, and polar surface area (PSA) ˂ 140 Å2), the chemical is said to be impermeable or badly absorbed [13].

3 Results

3.1 QSAR modeling

The results of the QSAR modeling are presented in Tables 1, 2, 3, and 4 and Figs. 1, 2, and 3.

Table 1 The symbols, descriptions, and classes of descriptors for the selected model
Table 2 The pIC50, predicted pIC50, and the residual values for the studied molecules
Table 3 MF, VIF, and correlation between descriptors of the selected model
Fig. 1
figure 1

Scatter plot of the actual pIC50 against predicted pIC50 of both the test and training sets of the selected model

Fig. 2
figure 2

Scatter plot of actual pIC50 against the residuals of both the test and training sets of the selected model

Fig. 3
figure 3

Williams plot of the selected model

Model 1

$$ {\displaystyle \begin{array}{c}{\mathbf{pIC}}_{\mathbf{50}}=0.951235508\ast \mathbf{ATSC}\mathbf{1}\mathbf{p}+1.336909853\ast \mathbf{GATS}\mathbf{1}\mathbf{s}-0.811346737\ast \mathbf{GATS}\mathbf{8}\mathbf{s}-3.524533882\ast \mathbf{SpMin}\mathbf{8}\_\mathbf{Bhm}+0.727035851\ast \mathbf{RDF}\mathbf{65}\mathbf{e}+2.794753386\ast \mathbf{P}\mathbf{1}\mathbf{e}+7.074425161\\ {}{R^2}_{trng}=0.9459,{R^2}_{adj}=0.9311,{Q^2}_{cv}=0.8947,{N}_{trng}=29,{R^2}_{test}=0.7008,{N}_{test}=10, and\ LOF=0.1195\end{array}} $$

3.2 Molecular docking

The results of the molecular docking are presented in Table 5 and Figs. 4 and 5.

Fig. 4
figure 4

2D structures of a Complex 3 and b Complex 4 with bond distances using the Discovery Studio visualizer

Fig. 5
figure 5

3D structures of a Complex 3 and b Complex 4 using PyMOL

3.3 Pharmacokinetics studies

The results of the Pharmacokinetic studies are presented in Table 6 and Fig. 6.

Fig. 6
figure 6

The bioavailability radar of a molecule 3 and b molecule 4 with the highest docking score

4 Discussion

4.1 QSAR modeling

The studied model was selected and reported because it is statistically fit with the following assessment parameters as compared to other models built: R2 of 0.9459, R2adj of 0.9311, Qcv2 of 0.8947, R2test of 0.7008, and LOF of 0.1195. The selected model was found to have passed the minimum recommended values for the validation of a good QSAR models as reported by [24].

The details of the descriptors that appear in the selected model are presented in Table 1. The positive coefficient of ATSC1p, GATS1s, RDF65e, and P1e descriptors indicate the positive correlation of this descriptor to the inhibitory activities of EGFRWT inhibitors that is the more you have these types of descriptors the more the inhibitory activity of the EGFRWT inhibitors against EGFR enzyme. In-view of the other descriptors with negative coefficients (GATS8s and SpMin8_Bhm) signifies the negative correlation of the descriptors to the inhibitory activities of the EGFRWT inhibitors. The lesser the number of these descriptors in the structures of EGFRWT inhibitors the more the action of EGFRWT inhibitors against EGFR enzyme.

4.1.1 Interpretations of the descriptors in the best model

ATSC1p is a Centered Broto-Moreau autocorrelation - lag 1/weighted by polarizabilities. This is computed by changing properties of the atom (w) with it equidistant values (w’), which is gotten by subtracting the standard value w of the compound from each w’ value:

$$ ATSCkw=\sum \limits_{i=1}^{\mathrm{nAT}\hbox{-} 1}\sum \limits_{j=i+1}^{\mathrm{nAT}}\left|\left({w}_i-\overline{w}\right)\right|\cdot \left|\left({w}_j-\overline{w}\right)\right|\cdot \delta \left({d}_{ij};k\right)=\sum \limits_{i=1}^{\mathrm{nAT}}\sum \limits_{j=i+1}^{\mathrm{nAT}}{w}_i^{\hbox{'}}\cdot {w}_j^{\hbox{'}}\cdot \delta \left({d}_{ij};k\right) $$

It was shown that if properties are equidistant only, then all autocorrelation descriptors are considered to be orthogonal, thus given the fitness of the succeeding statistics. The H-filled molecular graph presents the compound, which gives the properties of the compound and the sequential number of the vertices. The pairs of atoms that enter the summation were found using topological distance matrix.

GATS1s and GATS8s are Geary autocorrelation - lag 1 and 8/weighted by I-state. This descriptor measures the quality of the link between atomic charges of two atoms 8 bonds apart.

SpMin8_Bhm is the smallest absolute eigenvalue of Burden modified matrix - n 8/weighted by relative mass. SpMin is the minimum eigenvalue, called leading eigenvalue or spectral radius. This kind of function was called by Ivanciuc matrix spectrum operators (Eriksson et al., 2003). This eigenvalue has been suggested as an index of molecular branching, the smallest values corresponding to chain graphs.

RDF65e is radial distribution function - 065/weighted by relative Sanderson electronegativities, and this descriptor is based on the way atoms are arranged in the regular representation of a compound and develops a radial distribution function code (RDF code) that presents certain features in common with the 3D-MORSE code. It also explains the steric hindrance of a molecule. The RDF descriptor also provides information about ring types, atom types, the bond lengths, and planar and nonplanar systems.

P1e is the 1st component shape directional WHIM index/weighted by relative Sanderson electronegativities.

The scatter plot of predicted activities of both the test and training sets against the Actual pIC50 is shown in Fig. 1. It can be seen from the plot that the values were plotted around the straight line which shows the significance of the selected model. Also scatter plot of Actual pIC50 against the residuals of both the training and test is also shown (Fig. 2). The irregular appearance of these residuals on either side of zero on the plot shows the nonexistence of methodological error in the selected model.

The difference between the actual and the predicted activities in Table 2 is termed residual. The low residual values noted in the table verified the reliability of the selected model.

The correlation matrix of the descriptors that appear in the reported model was carried out (Table 3) and the descriptors were found to be orthogonal meaning no correlation exists between them. This indicates that the physicochemical parameters (descriptors) used in developing the reported model were of good quality. The calculated VIF values for the descriptors in the model building set of selected model were obtained to be less than 5 (Table 3) indicating the fitness of the selected model, and the descriptors were independent of one another. The MF value (Table 4) shows the contribution of a descriptor in comparison to other descriptors in the best model. The signs show the different direction of the descriptors of either improvement or reduction in the values of these descriptors will enhance the inhibitory activities.

Table 4 Y-scrambling test

The result of Y-scrambling test presented in Table 4 shows that ten (10) random models were generated; the R2 and Q2 values for these ten random models were found to be low. This verified the robustness of the selected model and that the model was not obtained by chance correlation.

The Williams plot presented in Fig. 3 identified five (5) influential compounds from which were all in the validation set. It is essential to understand that these molecules with leverage value higher than the threshold h*(h* = 0.72) are not considered when designing new EGFRWT inhibitors. These molecules might be dissimilar/structurally different from the molecules used to generate the model and, thus may have different mechanism of action.

4.2 Molecular docking

Molecular docking on the EGFR enzyme and five most active 2, 3-dihydro-[1, 4] dioxino [2, 3-f] quinazoline derivatives (ligands) (EGFRWT inhibitors) were studied (Table 5). From Table 5, we can see that molecule 4 has the highest docking score of − 8.3 kcal/mol which might be as a result of hydrophobic interactions formed with LEU718, LEU792, LYS745, LEU788, VAL726, ALA743, and LEU844 amino acid residues in the active site of EGFR enzyme. Hydrogen attach to one of the nitrogen of the quinazoline moiety (molecule 4) formed hydrogen bond in the active site of EGFR enzyme with THR790 and GLN791 amino acid residues with bond lengths of 2.9464 Å and 2.3643 Å. The most active molecule (3) has a docking score of − 8.0 kcal/mol. The hydrogen attach to one of the nitrogen of the quinazoline moiety (ligand 3) formed hydrogen bond interaction with MET793 amino acid residue of bond length of 2.4928 Å. And also, it formed hydrophobic interaction with amino acid residues LYS745, VAL726MET766, LEU718, and ALA743 of the EGFR enzyme. The 2D and 3D structures of complex 3 and 4 are shown in Figs. 4 and 5.

Table 5 The binding energy, amino acid residues, hydrogen bond interactions, and bond length of some ligands.

4.3 Pharmacokinetics studies

The results of the pharmacokinetics studies of the most active compounds are shown in Table 6. From the table it can be see that none of the molecules violate any of the criteria stated; it means there is a high tendency that all of these molecules might be pharmacologically active. In a null shell, these molecules are said to have good absorption, low toxicity level, orally bioavailable, and permeable. The bioavailability radar gives an overview of the drug-likeness of a molecule (Fig. 6). The region painted pink indicates the range for each features.

Table 6 ADME and drug-likeness properties

5 Conclusion

A mode with a very high predictive power on 39 non-small cell lung cancer agents (NSCLC) (EGFRWT inhibitors) was developed using QSAR. The reported model was selected because it is statistically fit with the following assessment parameters as compared to other models built: R2 of 0.9459, R2adj of 0.9311, Qcv2 of 0.8947, R2test of 0.7008, and LOF of 0.1195. The reported model was further subjected to other assessments such as applicability domain, Y-scrambling, and VIF and found to be statistically significant. Molecular docking was used to explore the kind of interactions between five most active EGFRWT inhibitors and EGFR enzyme. Molecule 4 has the highest docking score of − 8.3 kcal/mol among co-ligands. This might be a result of hydrophobic interactions formed with LEU718, LEU792, LYS745, LEU788, VAL726, ALA743, and LEU844 amino acid residues in the active site of EGFR enzyme. Hydrogen attach to one of the nitrogen of the quinazoline moiety (molecule 4) formed hydrogen bond in the active site of EGFR enzyme with THR790 and GLN791 amino acid residues with bond lengths of 2.9464 Å and 2.3643 Å. The pharmacokinetics studies indicated that these molecules have good absorption, low toxicity level, and permeability properties. The results of this study give room for designing new potent EGFRWT inhibitors against their target enzyme.