Background

Aphis craccivora, known as cowpea aphid, peanut (groundnut) aphid, or black legume aphid, is one of the most dangerous agricultural pests which directly causes harm to plants by delaying and deforming the growth of the plants through malnutrition. The molasses manufactured by the vector are placed on plants and stimulate the growth of molds with soot that limits photosynthesis. Aphis craccivora has many plant hosts, though it seemingly forechoice to groups of bean family. Other plants it hosts are families of Solanaceae, Rosaceae, Malvaceae, Chenopodiaceae, Caryophyllaceae, Ranunculaceae, Cucurbitaceae, Brassicaceae, and Asteraceae. Aphid is a vector of a series of plant viruses that include peanut rosette virus, groundnut mottle virus, mosaic virus common bean, alfalfa mosaic virus, and cucumber mosaic virus (Wikipedia, 2018).

Furan- and amide-containing compounds are among the molecular structures found to have an extensively wide range of applications in the field of medicine and agrochemical due to their extensive range of biological activity like antimicrobial/anti-inflammatory activity (Huczyński et al. 2012; Ravindra et al. 2006; Özden et al. 2005), as antibiotic activity (Arjona et al. 1999), as pesticides such antifungal (Yao et al. 2017), and insecticides (Teixeira et al. 2015; Wang et al. 2013) among others. The furan-2(5H)-one was examined to be a potential inhibitor of nicotinic acetylcholine receptor (Tian et al., 2019), and this was the reason for the docking study on the crystal structure of this protein.

The quantitative analysis of the structure-activity relationship (QSAR) is among the most efficient ways to optimize the main compounds and design new compounds. QSAR can be used to predict bioactivities, like toxicity, carcinogenicity, and mutagenicity, depending on the structural characteristics of the molecules and the actual mathematical models. Nowadays, one can easily and accurately calculate quantum chemical parameters of the compounds due to fast development in computer technology as well as theoretical quantum chemical study which helps in predicting the new compounds with better activity than the existing ones. This quantum chemical calculation is extensively applied while forming the QSAR models (Gagic et al. 2016). Molecular docking helps to investigate the capacity of the prepared compounds toward the interaction with the protein residue of the target organism and to also predict the preferred orientation of the molecules.

The objective of the research is to discover a new model that predicts the activity of chemical products with better activity capable of destroying Aphis craccivora using Genetic Function Approximation (GFA) or molecular docking techniques.

Material and methods

Dataset

In this work, we used a dataset of 20 compounds to design a relation between the chemical traces of compounds and their insecticidal activity. These 20 compounds of novel4-(N,N-diarylmethylamines) furan-2(5H)-one derivatives were obtained from the literature (Tian et al., 2019). The logarithm of the measured LC50 (μg mL−1) against insecticidal activity given by p LC50 (p LC50 = −log 1/LC50) was taken as a dependent parameter; therefore, the data was linearly correlated with the independent parameter/descriptors (Edache et al. 2017).

Optimization/molecular descriptor calculation

The database (see Fig. 1 and Table 1) was optimized at a density function theory level using the “Becke’s three-parameter read-Yang-Parr hybrid” (B3LYP) function together with “6-31G**” basis set of Spartan14 software (Arthur et al. 2016a). Graphical-user-interface of Spartan14 was utilized in drawing the 2D molecular structures of the dataset which were later exported in the form of 3D. The optimized structures were then taken to PaDel descriptor software to calculate the quantum molecular descriptors (Yap 2011).

Fig. 1
figure 1

The parent compound of the dataset

Table 1 Dataset compounds

Data division

To get a validated model, the dataset was split into (3:1) train test sets. Accordingly, the split was done in such a way that the compounds forming the train (70% of the data) and the test sets (30% of the data) are shared within an entire descriptive space filled by the complete dataset as described by Kennard-Stone Algorithm method (Arthur et al. 2016b).

The generated molecular descriptors were taken for regression analysis, with experimental activities as dependent parameters where the molecular descriptors served as independent parameters. With the Genetic Function Approximation method (GFA) incorporated in “Material Studio 2017” software, the compounds of train sets were utilized to develop the QSAR model. Four QSAR models were built where the best model was chosen according to the one with the lowest score of lack of fit (LOF) given as follows:

$$ \mathrm{LOF}=\mathrm{SSE}{\left(1-\frac{c+ dp}{M}\right)}^2 $$
(i)

where SSE represents the sum of squares of errors, d is a smoothing parameter defined by the user, c = number of terms a model possessed in addition to the constant term, M is equal to the number of samples present in the training set, and p = overall number of descriptors present in all terms of the model excluding the constant term (Edache et al. 2015).

Internal validation

The generated model was validated internally by the following parameters:

(a) The correlation coefficient (R2): explain the division of overall variation ascribed to the built model. The accepted value of R2 ranges from 0.5 to <1 and more the value of R2 and the model considered to be a better model as R2 approaches 1.0, though there are other analyses that the model passed to be a better one. Being the most common internal validation pointer, R2 is expressed as follows:

$$ R2=1-\frac{\sum {\left(Y\mathrm{expt}-Y\mathrm{perdt}\right)}^2}{\sum {\left(Y\mathrm{expt}-\overline{Y}\mathrm{train}\right)}^2} $$
(ii)

where Yexpt, Ypredt, and \( \overline{Y} \)train represent the experimental, predictive, and average activities of the training set (Adeniji et al. 2018).

(b) Adjusted R2: The value of R2 is inconsistent to evaluate the power of the built model. Thus, R2 is adjusted to restore and stabilize the model. This adjusted R2 is defined in equation iii as:

$$ R2\mathrm{adj}=\left(1-R2\right)\frac{\left(n-1\right)}{n-P-1}=\frac{\left(n-1\right)\left({R}^2-P\right)}{n-P+1} $$
(iii)

where p presents “the number of descriptors constituted the model,” while n = number of training set compounds (Ibrahim et al. 2018a).

(c) Cross-validated R2: The validity of the models was identified by a cross-validation test measured by predictive Q2cv. For a “leave one out (LOO) cross-validation,” a data point is eliminated (left-out) in the set and the model is readjusted; the predicted value of the eliminated data point is compared to its real value. This is repeated until each data removed. We can then calculate the value of Q2 using the sum of the squares of these elimination residues as in the equation below:

$$ Q2\mathrm{cv}=1-\frac{\sum {\left(Y\mathrm{predt}-Y\mathrm{expt}\right)}^2}{\sum {\left(Y\mathrm{expt}-\overline{Y}\mathrm{train}\right)}^2} $$
(iv)

where Yexpt, Ypredt, and \( \overline{Y} \)train represent the experimental, predictive, and average activities of the training set (Adedirin et al. 2018).

External validation

The prediction ability of the model was examined by an external validation through the ability of the model to predict the activity values of the test set compounds as well as its application in the calculating the predicted value of R2pred according to the equation below:

$$ R2=1-\frac{\sum {\left(Y\mathrm{predt}-Y\mathrm{expt}\right)}^2}{\sum {\left(Y\mathrm{expt}-\overline{Y}\mathrm{train}\right)}^2} $$
(v)

where Ypredt and Yexpt are the test set’s experimental and predicted activities while Ytrain indicates the average activities of the training set (Edache et al., 2017).

Statistical analysis of the descriptors

Variance inflation factor (VIF)

VIF is defined as the measure of multicollinearity amongst the independent variables (i.e., descriptors). It quantifies the extent of correlation between one predictor and the other predictors in a model.

$$ \mathrm{VIF}=\frac{1}{\left(1-{R}^2\right)} $$
(vi)

where R2 gives multiple correlation coefficient between the variables within the model. If the VIF is equal to 1, it means there is no intercorrelation in each variable, and if it ranges from 1 to 5, then it is said to be suitable and acceptable. But if the VIF turns out to be greater than 10, this indicates the instability of the model and need to be reexamined (Pourbasheer et al. 2015; Karthikeyan et al. 2009).

Mean effect (ME)

The average effect (mean effect) correlates the effect or influence of given molecular descriptors to the activity of the compounds that made up the model. The sign of descriptors shows the direction of their deviation toward the activity of compounds. That is an increase or decrease in the value of the descriptors will improve the activity of the compounds. The mean effect is defined by the following:

$$ \mathrm{Mean}\ \mathrm{effect}=\frac{B_j{\sum}_i^n{D}_j}{\sum_j^m\left({B}_j{\sum}_i^n{D}_j\right)} $$
(vii)

where Bj and Dj are the j-descriptor coefficient in a model and the values of each descriptor in training set, while m and n stand for the number of molecular descriptors as well as the number of compounds in the training set. To evaluate the significance of the model, the ME of all the descriptors was calculated (Edache et al. 2015).

Applicability domain

To confirm the reliability of the model and to examine the outliers as well as the influential compounds, it is very important to evaluate its domain of applicability. It aimed at predicting the uncertainty of a compound depends on its similarities to the compounds used in building the model and also the distance between the train and test set of the compounds. This can be achieved by employing William’s plot which was plotted using standardized residuals versus the leverages. The leverages for a particular chemical compound are given as follows:

$$ {h}_{\mathrm{i}}={Z}_{\mathrm{i}}{\left({Z}^{\mathrm{T}}\ Z\right)}^{-1}\ {Z_{\mathrm{i}}}^{\mathrm{T}} $$
(viii)

where hi = leverage for a particular compound, Zi = matrix i of the training set. Z = nxk descriptor-matrix for a training set compounds. ZT = transpose of Z matrix. The warning leverage (h*) that is the boundary for usual values of Z outliers is given as follows:

$$ {h}^{\ast }=3\frac{\left(p+1\right)}{n} $$
(ix)

where n = number of compounds in the training set whereas p gives the number of descriptors present in the model (Ibrahim et al. 2018a).

Ligand and receptor preparation

From the RCSBPDB (www.rcsb.org), the PDB format of the receptor was successfully downloaded. This was then taken to the discovery studio for an appropriate preparation where all the residues associated with it such as a ligand, water molecules, and other traces associated with the receptor were removed. The ligands (the optimized compounds) which were in the SDF file were transformed into the PDB file format. Figures 2 and 3 showed the prepared receptor and ligand (Ibrahim et al. 2018b).

Fig. 2
figure 2

Prepared receptor

Fig. 3
figure 3

Prepared ligand

Results

QSAR model

Genetic function algorithm (GFA) was used to generate the three QSAR models which predicted the activity of the compounds. The first model was chosen as the optimal model due to its statistical significance was presented in equation (x) below:

$$ Y=0.262046205\ast \mathbf{nCl}-5.412377232\ast \mathbf{ATSc4}-5.657053582\ast \mathbf{weta}\mathbf{3}.\mathbf{polar}+1.799880140 $$
(x)

All the validation/statistical parameters that signified the stability, robustness, and the prediction capability of the model were presented in Table 2.

Table 2 Validation parameter of the model

The name, symbol, and class of the three selected descriptors that made up the model are presented in Table 3.

Table 3 Descriptors and their description

Since the selected model is internally valid, then an external validation is the next step. Tables 4 and 5 represent the external validation as well as the calculation of predicted R2 of the best-chosen model.

Table 4 External validation
Table 5 Continuation of external validation

The experimental, predictive, and residual activity for both training sets and test sets are shown in Table 6. This table is represented to show the robustness of the model considering the lower residual values.

Table 6 The experimental, predictive, and residual activity

Statistical analyses

Statistical analyses on the model’s descriptors are very necessary in order to know how related they are. For that, Pearson’s correlation, variance inflation factor, mean effect (which contains regression analysis), and applicability domain were carried out.

Pearson’s correlation (Tables 7 and 8)

Figures 4 and 5 are the plot of predicted activity against the experimental activity (pLC50) and plot of standardized residual against the experimental activity (pLC50).

Table 7 Pearson’s correlation
Table 8 Standard regression coefficients “bj”, the values of mean effect (ME) and confidence interval (p values)
Fig. 4
figure 4

A plot of predicted activity against experimental activity (pLC50)

Fig. 5
figure 5

A plot of standardized residual against experimental activity (pLC50)

Applicability domain

A graph of leverages of each compound of dataset versus their standardized residuals terms William’s plot was presented in the Fig. 6 below.

Fig. 6
figure 6

William’s plot

Docking studies

Molecular docking analysis was carried out between the ligands (compounds) and the receptor to evaluate the binding affinity at the ligand-receptor interface.

Result of the design

The structure of the three (3) compounds which were designed using an optimization method of structure-based design. The structure of the chosen scaffold (compound 13) was presented in Fig. 9 below.

Discussion

QSAR model

The QSAR examination was carried out to relate the structure-activity relationship of novel 4-(N,N-diarylmethylamines) furan-2(5H)-one derivatives as a potent inhibitor of Aphis craccivora. Three descriptors were utilized in constructing the QSAR model which predicted the activity of the compounds based on the genetic function algorithm (GFA). The first model was chosen as the optimal model due to its statistical significance. The best-chosen model constructed was presented in equation.

All the validation/statistical parameters that signified the stability, robustness, and the prediction capability of the model were presented in Table 2. From the table, the highly calculated R2 value (0.8715) for the predicted activity indicated the robustness of the model. The descriptors, definitions, and their classes are represented in Table 3 below. The fact that 2D and 3D descriptors are present in the model implies that the descriptors used in the model can determine a better insecticidal activity of the compounds. The individual capability and inducing power of the selected descriptors toward the activity of the compounds depend on their values, signs, and as well their mean effects. Tables 4 and 5 represent the external validation as well as the calculation of predicted R2 of the best-chosen model.

Statistical analysis

The experimental, predictive, and residual activity for both training sets and test sets are shown in Table 6. The residual value is the difference between the predicted and actual activity.

To evaluate the relationships between each descriptor used in the built model, Pearson’s correlation was carried out on the values of the model’s descriptors and the results were presented in Table 7. The results show that the descriptors are not significantly inter-correlated for the fact that none of their correlation coefficients are up to 0.5, and this indicates the robustness as well as the stability of the built model. The variance inflation factor (VIF) values for each of the three descriptors were not up to 2, which indicates that the descriptors and the model are stable and accepted.

Table 8 showed the standard regression coefficients “bj,” the values of mean effect (ME), and confidence interval (p values). These give vital information on the impact and contribution of the descriptors toward the built model. The individual capability and inducing power of the selected descriptors toward the activity of the compounds depend on their values, signs, and as well their mean effects. The p values of the three descriptors (at 95% c.l.) that made up the model are all < 0.05; this implies that there is a significant relationship among these descriptors (as contrary to the null hypothesis) and the inhibitory concentration of the compounds.

Figure 5 which presented a graph of observed activity versus standardized residual shows a random dispersion at the baseline where the standardized residual is zero. Therefore, no systematic error occurred in the built model.

A graph of leverages of each compound of dataset versus their standardized residual terms William’s plot was plotted to discover the outliers as well as the chemical influential values of the model. The domain of applicability was established within a box at ± 3.0 limit for the residuals and leverage threshold h* (where h* calculated to be 0.8). The result indicates that except three compounds from the test set, all the molecules in the dataset are within the box of the applicability domain of the model. This may be characterized by their clear differences in chemical structures by considering the rest of the compounds highlighted in the dataset.

Molecular docking study

Due to unavailability of the crystal structure of Aphis craccivora, the complex crystal structure of the acetylcholine (protein AChBP) from Lymnaea stagnalis and imidacloprid with PDB code 2ZJU was utilized for the docking analysis since it possess “high homology extracellular domain” of the A. craccivora protein (nAChR) which has been used for many docking studies involving A. craccivora. The docking studies were performed between 2ZJU and the ligands (compounds) of novel4-(N,N-diarylmethylamines) furan-2(5H)-one derivatives to investigate the binding energy of the compounds to the target site of the insect. The ligands show a good interaction with the active site of the Aphis craccivora that is to say they inhibit the activity of the insect. Some ligands show high binding energy that varies from − 7.9 to − 8.4 kcalmol−1 as presented in Table 9. However, compound 13 with the highest binding score (− 8.4 kcal/mol) possessed an interaction mode with H-bond of ARG137 and 2.60716 bond length and hydrophobic interaction of TYR89, TYR89, ASN90, VAL183, TRP53, TYR89, and TYR185. The interaction between the compound with highest binding energy and the binding pocket of the receptor is shown in Fig. 7 while Figs. 8 and 9 is the 2D hydrogen bond interaction of compound 13 with the receptor.

Table 9 Ligands, binding affinity, H-bond, and hydrophobic interaction between high binding score compounds and receptor
Fig. 7
figure 7

The interaction between the compound with the highest docking score and receptor

Fig. 8
figure 8

2D interaction of compound 13

Fig. 9
figure 9

2D structure of the template

Design

In our research, we utilized the method of structure-based design to design a new (novel) insecticidal compound with a better activity by taking the compound with the highest docking score which is compound 13 (with binding energy of − 8.4 kcal/mol) as our template compound and thus provides suitable insecticidal activity and appeared very inspiring as a noteful scaffold. Compound 13 was selected as a synthetically best structure in which some structural advancement was performed on it. The newly designed compounds (A) 4-(((2-chloro-4-(trichloromethyl)pyridine-1(2H)-yl)methyl)(2-chloro-4-(trifluoromethyl)benzyl)amino)furan-2(5H)-one, (B) 4-(((2-chloro-4-(trichloromethyl)pyridine-1(2H)-yl)methyl)(3-chloro-4-(trifluoromethyl)benzyl)amino)furan-2-(5H)-one, and (C) 4-(((2-chloro-4-(2,2,2-trichloroethyl)pyridin-1(2H)-yl)methyl)(4-(trifluoromethyl)benzyl)amino)furan-2(5H)-one with their binding energy of – 8.9, – 9.1, and – 9.0 kcal/mol (as shown in Figs. 10, 11, and 12) were discovered to be of high quality and have greater binding affinity compared to the one obtained from the literature.

Fig. 10
figure 10

2D structure of the designed compound (A)

Fig. 11
figure 11

2D structure of the designed compound (B)

Fig. 12
figure 12

2D structure of the designed compound (C)

Conclusion

This research involves a QSAR and molecular docking studies on 20 compounds of novel 4-(N,N-diarylmethylamines) furan-2(5H)-one derivatives against Aphis craccivora. Using DFT for molecule optimization, Genetic Function Approximation (GFA) was employed in generating the built model. Out of three models built, the first model was identified to be the optimal constituted with good statistical parameters such as R2 = 0.871489, R2adj = 0.83644, cross-validated R2 = 0.790821, and external R2 = 0.550768. A decrease in negative coefficient descriptors (like ATSc4 and Weta3.polar) and an increase in positive coefficients descriptors (like nCl) will improve the activities of the compounds against A. craccivora. According to the docking scores, most of the ligands (compounds) show good inhibitory activity against A. craccivora protein. However, ligands 13 showed a higher binding affinity of – 8.4 kcal/mol. This compound has a strong affinity with the macromolecular target point of the A. craccivora (2zju) producing H-bond and as well the hydrophobic interaction at the target point of amino acid residue. Molecular docking gave an insight into the structure-based design of the new compounds with better activity against A. craccivora in which three compounds A, B, and C were designed and discovered to be of high quality and have greater binding affinity compared to the one obtained from the literature.