Background

After years of efforts to fight and control of malaria, it is still a prevalent and deadly infectious disease, especially in the third-world countries in Africa, Asia, and South America [1, 2]. The estimated deaths because of malaria in 2015 were 429,000 (range 235,000–639,000), which were mainly distributed in the Africa (92%), Southwest Asia (6%) and the Eastern Mediterranean (2%) [3]. The pregnant women and children below 5 years of age are the more vulnerable groups, and about 85% of deaths occurring in children with this age range [4].

The disease is caused by a parasite of the genus Plasmodium. The main species of Plasmodium are Plasmodium falciparum, Plasmodium vivax, Plasmodium ovale, Plasmodium knowlesi and Plasmodium malariae, with P. falciparum responsible for most of the mortality [1, 4].

Many compounds with anti-malarial activity have been described, including quinine, chloroquine, proguanil, pyrimethamine, artemisinin, mefloquine, atovaquone [5]. The major problem in the treatment of malaria is that Plasmodium parasites become resistant to anti-malarial drugs. The most commonly used anti-malarial drug, chloroquine, became ineffective due to rapidly spreading resistance of P. falciparum to this compound; the newer anti-malarial drugs, such as mefloquine or artemisinins also face to resistance problem. The other problem in control of malaria is the lack of an effective vaccine for this disease. Therefore, developing new anti-malarial agents is a necessity and chemical modification of existing compounds is one of the strategies available [1].

In silico methods, such as quantitative structure activity relationship (QSAR), molecular docking and pharmacophore modelling by decreasing the time and cost of drug discovery play a significant role in the field of drug design and development [6]. QSAR can provide a mathematical relationship of the physicochemical properties and structural features that is required for a specific activity for a set of similar compounds. In this way, synthesis of potential candidate molecules can be performed by focusing on the chemical characteristics which have influenced on a specific activity [7].

QSAR methods have previously been used to investigate anti-malarial compounds. In 2001, Agrawal et al. studied the anti-malarial activity of a series of sulfonamide derivatives (2,4-diamino-6-quinazoline sulfonamides) [8]. In 2002, 3D-QSAR studies on the artemisinin analogues were performed by Cheng et al., a study done on the basis of the docking models employing comparative molecular force fields analysis (CoMFA) and comparative molecular similarity indices analysis (CoMSIA) [9]. Katritzky et al. investigated two various set of compounds for each of two strains D6 and NF54 of Plasmodium falciparum using QSAR modelling with CODESSA PRO software in 2006 [10]. A study on anti-malarial artemisinin derivatives was done by Cardoso et al. in 2008 using molecular electrostatic potential (MEP) maps and multivariate QSAR [11]. In 2015, Ojha and Roy reported the status of anti-malarial drug research from the year 2011 to 2014 with special reference to application of QSAR models. In their report, aminoquinolines as a group of anti-malarial compounds were analysed by various research groups using QSAR models; the other groups of compounds were endochin analogs, artemisinin analogs, aurone chalcone, prodiginines, acridine, hydroxypyridinones and cycloguanil derivatives, which their QSAR modelling reported [7]. In 2018, Cheoymang and Na-Bangchang in a systematic review article reported about application of in silico models for anti-malarial drug discovery in the years between 2008 and 2015. In this article 2D- or 3D-QSAR is mentioned as one of the commonly applied in silico methods for investigating on anti-malarial compounds [12].

Imidazolopiperazine is a class of anti-malarial compounds, including KAF156 (also known as GNF156) which is active against a wide range of Plasmodium species and in phase 2 trials have shown better or analogous parasite killing rates compared to the effective artemisinin-based combination therapy (ACT) [13, 14]. In this article, the anti-malarial activity of a set of imidazolopiperazines was investigated using quantitative structure activity relationship. Artificial neural networks were used for modelling the activity of 33 imidazolopiperazines derivatives.

Methods

Data set

A data set consisting of imidazolopiperazines reported by Wu et al. and Nagle et al. [15, 16] was used for this study. A set of 33 compounds was selected which their structural skeleton and the name of compounds are displayed in Table 1.

Table 1 Structures of imidazolopiperazine derivatives and their biological activities (IC50, nM) for 3D7 and W2

Descriptor generation and pretreatment

After drawing the 2D structures of the 33 imidazolopiperazine derivatives using HyperChem software (Ver. 8.0.3, Hypercube Inc., Gainesville, USA), the geometries of the molecules were fully optimized using the semi-empirical AM1 method. The optimization was done until the root mean square gradient achieves 0.001 kcal mol−1 or 1000 cycles for all the molecular structures.

The resulting optimized geometries were transferred to the DRAGON software [17, 18], and the descriptors were calculated. Then, the same descriptors for all the structures were kept and others were removed. At the first step for pretreatment of the descriptors, the constant or near-constant variables among the remained descriptors were removed. At the second step, for decreasing the redundancy existing in the descriptors, the correlation of descriptors with each other and with the biological activities (pIC50) against 3D7 and W2 was examined and among the collinear ones (r > 0. 95), the descriptors that had the highest correlation with pIC50 for 3D7 and W2 were retained. After these steps, the number of remaining descriptors for all the 33 compounds in each mode (against 3D7 and W2) was about 555 which were collected in an n × m data matrix (D), where n and m are the number of imidazolopiperazine derivatives (= 33) and the number of descriptors (= 555), respectively. The data set was randomly divided to training set with 23 compounds, test and validation set, each of them include 5 compounds.

It should be noted that the variable selection was done by stepwise multiple linear regression (SMLR) on the training set using SPSS (version 19.0, SPSS Inc., http://www.spss.com). Artificial neural networks were done using MATLAB (version 7.6, Math work, Inc., http://www.mathworks.com). All other statistical calculations and evaluations were also conducted in MATLAB. In ANN modeling, a two-layer feed-forward network with sigmoid hidden neurons and linear output neurons was used with only two hidden neurons. The mean square error was also used as the performance criteria of the network.

Results

In the first step, due to the preference of using the linear models to the non-linear ones [19], the QSAR modelling of the mentioned imidazolopiperazine derivatives with anti-malarial activities was investigated using the linear models. This effort did not have good results, by using multiple linear regression (MLR) and partial least square (PLS) models, and forced the authors to test the nonlinear models.

At the next step for evaluation the randomized distribution of the molecules belong to the three data set (the training, validation and test sets) in the space of descriptors, principal component analysis (PCA) was applied. The two-dimensional (2D) PCA plot (PC1 vs. PC2) of imidazolopiperazine derivative molecules for the data of two models (3D7 and W2) is displayed in Fig. 1.

Fig. 1
figure 1

Random distribution of the training, validation, and test sets at two-dimensional PCA plot (PC1 vs. PC2) related to a 3D7 and b W2

Variable selection

Variable selection was done using SMLR on the 23 × 555 data matrix. The statistic parameters like Fisherʼs F value (F) and correlation coefficient (R2) were employed for evaluation the goodness of the selected variables and as fitting criteria. In this way, variables with the most significant values of F and highest correlation coefficient were selected by inserting into/removing from the model respectively and 12 variables were selected by using this approach. In the next step, the models with 1 to 12 variables were checked using ANN method for training and validation sets [20] and it was found that in the models with more than 6 variables despite of improvement in the training set results, the prediction ability of the validation set reduced because of overfitting [21]. The results of SMLR for the selected variables are summarized in Additional file 1: Tables S1–S3 for the 3D7 model and in Additional file 1: Table S4–S6 for the W2 model.

So the number of 6 variables was selected for both 3D7 and W2 models. The 6 selected descriptors for modelling the biological activities (pIC50) against 3D7 and W2 are represented in Additional file 1: Tables S7, S8, respectively. It should be mentioned that the results of test set were not considered during selection of the optimum model. The definition of the used molecular descriptors for modelling the biological activities (pIC50) for the 3D7 and W2 strain are presented in Table 2.

Table 2 The definition of the used molecular descriptors for modelling of two kinds of activities (3D7 and W2)

After the selection of the descriptors, the evaluation of correlation was done using the pair-correlation matrix for 23 training compounds and the total of training and test sets (28 compounds). The related data are shown in Tables 3 and 4 for six descriptors of the 3D7 and W2 models respectively.

Table 3 The pair correlation coefficient (R2) and the variance inflation factor (VIF) for the 6 descriptors at the training and total set for 3D7 model
Table 4 The pair correlation coefficient (R2) and the variance inflation factor (VIF) for the 6 descriptors at the training and total set for W2 model

Model development

At the model development and validation steps, the training set with 23 compounds (70% of the imidazolopiperazine derivative molecules) was used for artificial neural networks modelling. Feed forward artificial neural networks with Levenberg–Marquardt algorithm were used for this purpose. The validation and test sets (each of them with 5 compounds containing 15% of the imidazolopiperazine derivative molecules) were used to validate the prediction ability of the proposed anti-malarial models. The statistical parameters of the used ANN models is represented in Table 5.

Table 5 Statistical parameters of the artificial neural networks models used for prediction of anti-malarial activity at 3D7 and W2

The actual and predicted amounts of pIC50 of the used imidazolopiperazine derivatives as anti-malarial structures against 3D7 and W2 strain are represented in Table 6. There is good agreement between predicted and actual values of pIC50 in the proposed anti-malarial model for 3D7 activity and can be seen visually in Fig. 2. About model constructed for activity against W2 however the training and validation was acceptable but the prediction ability in the external test set was not as good as 3D7.

Table 6 Experimental and predicted activities (pIC50) of the imidazolopiperazine derivatives as anti-malarial structures against 3D7 and W2
Fig. 2
figure 2

Plot of the pIC50 predicted (using artificial neural networks model) versus the experimental pIC50 for a 3D7 and b W2

Despite the good agreement between actual and predicted values in the two 3D7 and W2 models and specifically in the first one, but because the high number of descriptors (about 555 descriptors) which were selected in the variable selection step, there was the possibility of obtaining chancy models [22]. For evaluation of chance correlation, y-scrambling test was done. The dependent variable of the two 3D7 and W2 models (the experimental pIC50 of the selected derivatives) was randomly shuffled 30 times and ANN was run on them each time. The maximum correlation coefficient of the test set (R2MP) for these scrambled 3D7 and W2 models were 0.09 and 0.13 respectively. These low values of the correlation coefficients of the scrambled models (R2MP) in comparison to the original 3D7 and W2 models imply the absence of chance correlation.

Applicability domain

Applicability domain (AD) of a QSAR model is an important point, because it defines the model limitations. Actually “the applicability domain of a (Q)SAR model is the response and chemical structure space in which the model makes predictions with a given reliability” [23].

Different methods have been suggested for calculation of AD [24]. One of the recommended approaches to define AD is the method based on leverage and standard residual. The Williams plot that displays the standardized residuals versus leverage (hat diagonal) values is a way to verify the AD of a QSAR model [25, 26]. Leverage is proportional to the Mahalanobis distance of a query chemical from the centroid of the training set. For a given descriptor dataset X, the leverages are calculated with the (H = X (XʹX)−1Xʹ) equation, where Xʹ is the transpose of X matrix [24, 27]. The diagonal value (hi) represents the leverage value for ith point in the X dataset from the centre of the set of X observations. The higher leverage values represent the far compounds from the centre and they are more influential in model building [24]. It should be mentioned that a warning value for leverage is defined; so that if a query chemical has higher leverage than the warning value, it can be as unreliable prediction [24]. This warning leverage generally is equal to 3p/n where p is the number of model descriptors plus one (here p = 7), and n is the number of compounds used for the training model [24, 28]. It should be noted if the leverage of a query chemical was less than the warning value, there is not necessarily to be stayed on the range of the applicability domain of the model, and may be it has high standardized residuals. So in the Williams plot both of the two parameters (leverage values and standardized residuals) for surveying the AD of model has been considered. The Williams plots of 33 compounds of the models with 6 descriptors for 3D7 and W2 are displayed in Fig. 3.

Fig. 3
figure 3

The Williams plots of the 33 compounds obtained by 6 descriptors used in artificial neural network models for a 3D7 and b W2

Discussion

In this research an artificial neural network was employed to gain a set of descriptors and to build a QSAR model for antimalarial activity. The controversial topic is how each step for QSAR model building such as data collection, model validation and prediction is performed.

The randomized selection of prediction and test subsets is a good method for external evaluation of the final model [29]. For each 3D7 and W2 model, As seen in Fig. 1, the two-dimensional PCA plot (PC1 vs. PC2) show that the molecules belongs to the three training, validation and test data sets are randomly distributed in the space of descriptors.

The other important step in any QSAR study is variable selection, because the method which is used for descriptor choosing has a great impact on all subsequent steps in drug design. The ideal path for variable selection is extensively search to all possible combinations of the initial descriptors, which is impossible except with small data set which have small number of descriptors [30].

For this purpose after using stepwise MLR (SMLR), the variables with the most significant Fisherʼs value (F) and with the highest correlation coefficient (R2) were selected. In this way variables with the most significant Fisherʼs value and the highest correlation coefficient were selected by inserting into/removing from the model. The number of 12 variables were checked using ANN method for training and validation sets. At the end it was found that the prediction ability of the models with 6 variables (reported in Table 2) are the best.

The next step was the evaluation of correlation in the selected descriptors. In the case of correlation between descriptors, the efficiency of the QSAR models are reduced and leads to biased estimation [31]. The pair correlation matrix was evaluated for the six descriptors of the two 3D7 and W2 models (Tables 3 and 4).It is clear from the Tables 3 and 4, no serious dependency is found in both descriptor set.

In addition to pair correlation, another kind of linear dependency can limit the accuracy of model which is known as multicollinearity which is shown the linear dependency of a variable (predictor) to all others [31]. Variance inflation factor (VIF), which is given in the following equation, is a popular diagnostic index for appearing multicollinearity [32].

$${\text{VIF}}_{\text{i}} = \frac{1}{{1 - R_{i}^{2} }}$$

where \({\text{R}}_{\text{i}}^{2}\) is the R2-value obtained by regressing the ith predictor on the other predictors [32].

As it is shown in the last column of Tables 3 and 4, all the calculated VIF are less than 3, and as regards to the proposed critical value for VIF that is equal to 5.0, the information of none of the six used descriptors for both 3D7 and W2 models has multicollinearity with the other descriptors and the resulting models are acceptable.

Looking at the results of model development and validation (Table 5), we find that the values of R2 and MSE of the training set for both of 3D7 and W2 models are good and express the good fitness of the models. Nevertheless it is necessary to use validation and test set to check the prediction ability of the anti-malarial models. As can be seen in Table 5, the statistics of validation and test sets of the model suggested for inhibitory against 3D7 strain were excellent (R2val = 0.959, R2test = 0.920) and the values of MSE for the two validation and test data sets were also good (0.051 and 0.254, respectively). The model generated for inhibitory against W2 strain with R2val = 0.797 and R2test = 0.813 was acceptable and the MSE values of its validation and test data sets (0.290 and 0.740, respectively) were not good in case of test set. It is clear from the results that the anti-3D7 activity model with the excellent statistics values for training, validation, and test (R2train = 0.947, R2val = 0.959, R2test = 0.920) was better from the anti-W2 activity model and the latter was not very good but has acceptable performance which can be used for a brief estimation of activity against W2.

Also from the Williams plots (Fig. 3), it is clear that all the 33 compounds, except molecule No. 7 in W2 model have leverage values lower than the warning leverage. Also all the compounds were in the acceptable range of standardized residual (± 3σ). These results confirm that the prediction using six descriptors (which were selected by SMLR) in ANN models can be acceptable.

Also from the Williams plots (Fig. 3), it is clear that all the 33 compounds, except molecule No. 7 in W2 model have leverage values lower than the warning leverage. Also all the compounds were in the acceptable range of standardized residual (± 3σ). These results confirm that the prediction using six descriptors (which were selected by SMLR) in ANN models can be acceptable.

Conclusion

Malaria is a deadly infectious disease, which is prevalent especially in the tropical developing countries. Resistance to existing anti-malarial drugs is a factor forcing researchers to develop or modify the anti-malarial compounds. The QSAR study with highlighting the structure activity relationships which correlate the compounds’ structural features with the observed anti-malarial activities could be a suitable way to design and to modify anti-malarial compounds. Actually in silico drug design methods, such as QSAR, play an important role in the drug design process due to saving money and time.

In this research, the anti-malarial activity of 33 imidazolopiperazine derivatives was investigated at 3D7 and W2 strain, using QSAR method. The linear methods, such as MLR and PLS models was not suitable but non-linear ANN showed good performance. The statistical parameters were used to evaluate the results. The results of R2, MSE and leverage value showed that the prediction ability of ANN method for estimation of the anti-malarial activity in imidazolopiperazine compounds is good and can be used as a virtual tool for synthesis of analogous compounds.