Background

Malaria is a deadly infectious disease with about 228 million infected cases and 405,000 deaths worldwide, as recorded in 2018 [1]. The disease is caused by the bite of a mosquito having the Plasmodium parasite, which consists of five main species, Plasmodium falciparum, Plasmodium vivax, Plasmodium ovale, Plasmodium knowlesi and Plasmodium malariae [2]. Of these species, 90% of deaths (mostly in children) were related to P. falciparum [3]. Anti-malarial drugs, such as quinine, chloroquine, artemisinin, proguanil, pyrimethamine, mefloquine, and atovaquone, have been indicated as malaria treatment [4,5,6]. However, Plasmodium species developed resistance to most of these commonly used drugs. This resistance and the lack of a vaccine has become a major problem in malarial treatment in recent years [7]. Therefore, there is a pressing need to improve the efficiency by modifying existing compounds to face drug-resistance, as well as to discover novel anti-malarial compounds.

Due to funding investment constraints, in silico and collaborative approaches have become particularly attractive approaches for malaria drug discovery efforts. Some in silico techniques, namely molecular docking, pharmacophore models or quantitative structure–activity relationships (QSARs) significantly reduce the time and cost in the drug discovery process. Among the techniques, QSAR is considered a valuable tool that is applied extensively in rational drug design. The predictive QSAR model provides a mathematical correlation between the structural properties of the compounds and their anti-malarial activities using one-, two-, and three-dimensional descriptors of physicochemical properties, as well as structural characteristics relating to the activity. Once a reliable QSAR model has been developed, the biological activities of molecules can be predicted from the molecular descriptors by different methodologies, such as multiple linear regression (MLR), partial least squares (PLS), artificial neural networks (ANN) and heuristic method (HM). In recent years, QSAR models were applied to a variety of anti-malarial compounds to figure out physicochemical and structural characteristics that are essential for their activity [8,9,10,11,12,13]. Some QSAR models developed using sulfonamide and its derivatives, 5-(2-methylbenzimidazol-1-yl)-N-alklythiophene-2-carboxamid derivatives in order to select models that had the best predicting ability [6, 14]. Other studies used three-dimensional QSAR (3D-QSAR) combining with extra analysis gave striking structural characteristics that related to anti-malarial efficacy and the mechanism of action of anti-malarial compounds [15,16,17].

The anti-malarial activities of various groups of compounds, in particular quinine and its derivatives, had a satisfactory correlation with their anti-haemozoin activity [18]. Haemozoin is formed inside the food vacuoles of parasites to prevent lethal toxicity of haem, which is a product of the catabolism of haemoglobin. Thus, anti-haemozoin is an important therapeutic target in anti-malarial treatment. Recently, different approaches were highlighted. These approaches include the high-throughput screening (HTS) of anti-malarial drugs based on their physicochemical properties of haemozoin formation, or building computational models for in silico to screen novel anti-malarial drugs, or analog development from natural compounds or existing agents [19]. Of which, prediction models of correlation between anti-haemozoin and anti-malarial activities strongly assist in anti-malarial drug discovery, from modifying known compounds to identifying new chemical scaffolds for different targets of a large diverse database of compounds [18]. However, there is no QSAR model for anti-malarial activity of anti-haemozoin inhibitors. The aim of this study was to develop quantitative structure–activity relationship models to determine the influences of physiochemical structures of haemozoin inhibitors on anti-malarial activities.

Methods

The best QSAR model will be chosen and could be applied for screening and designing better anti-haemozoin compounds for anti-malarial activities in next studies. QSAR modelling was conducted for anti-malarial activities of haemozoin inhibitors using the multiple linear regression (MLR) and partial least square (PLS) methods. Database of 21 compounds possessing both anti-malarial and anti-haemozoin activities were used for building QSAR models. The IC50 of these compounds varied from 0.06 – 10.5 µM (or pIC50 ranged between -1.02 to 1.22). The QSAR model was chosen based on the predicted fitness plots and statistical values of the models. Evaluation of QSAR models depended on three data sets, the training, validation and test sets. The results included the corresponding descriptors (coefficients) and correlation of the observed—predicted values of anti-malarial activities and the statistical parameters. The parameters, correlation coefficient or coefficient of determination (R2 or r-squared), cross-validated r2 (or Q2) and r2 for the external test set (R2_pred), and root mean square error (RMSE) as fitting criteria, were employed to evaluate the goodness of the models. The predictive model was tested based on different methods, such as internal for training set and external validation for test set, as well as Y-randomization method.

Data set

To perform 2D-QSAR, a complete data set containing 21 anti-haemozoin compounds (Table 1) was taken from the experimental anti-malarial activities identifed in a previous work [20]. The half maximal inhibitory concentration (IC50) of the anti-haemozoin compounds was converted to logarithmic scale (pIC50) and used as the dependent variable. These compounds were randomly divided into two subsets, a training set (16 compounds) and a test set (6 compounds).

Table 1 Structures and their anti-malarial activities (IC50 values) of 21 anti-haemozoin compounds in building 2D-QSAR model

2D-QSAR

A flowchart for developing 2D-QSAR was conducted following eight steps (Fig. 1). Initially, database included 21 compounds having anti-Plasmodium 3D7A activity. The IC50 values of these compounds were converted into logarithm scale logIC50 or pIC50 (pIC50 =− logIC50). The process of energy minimization of the compounds was performed using MOE 2015.10. A further step was the calculation of 2D descriptors. A total of 206 descriptors described molecular structures, including geometrical, physicochemical, sterical and lipophilic, which were calculated using Descriptors tool in MOE 2015.10. The database was subsequently divided into two subsets, a training set and a test set, with a 75:25 ratio. The database was divided randomly using RAND or Diverse subset using MOE. Selection of descriptors was carried out carefully. Some descriptors were removed based on three methods, firstly, if more than 15% compounds had descriptor values of 0 using Microsoft Excel. Secondly, using Rapidminer Studio 8.2.0 to take out of descriptors of the compounds which possess 50% similarity. Thirdly, remove randomly one of two descriptors having a cross correlation value of more than 70% using Rapidminer. These selected descriptors were also separated according the ratios of between 0 to 1 using Normalize in Rapidminer Studio based on the Eq. 1 below.

$${\text{X}}_{{\text{n}}} = \frac{{{\text{X}}_{0} - {\text{Min}}_{0} }}{{{\text{Min}}_{0} - {\text{Min}}_{0} }}$$
(1)

of which: \({\text{X}}_{{\text{n}}}\): Value; \({\text{X}}_{0}\): Initial value; \({\text{Min}}_{0}\),\({\text{ Max}}_{0}\): Minimum, maximum of initial values.

Fig. 1
figure 1

General steps of generating the QSAR model

Contigency tool in MOE and BestFirst—a searching method with assessment algorithm CfsSubsetEval in Weka 3.8.1 were used to find out the suitable descriptors.

Some outliers were removed by using PCA or Z-score, if the compounds had Z-score values of more than 2.0 before building 2D-QSAR. Using MOE with Model tool, 2D-QSAR models were developed using MLR. The best models were selected based on the highest values of the square of the coefficient of determination (R2) value, internally cross-validated R2 (Q2), and the external validated R2 (R2_pred). Of which, external validation used the test set while the training set was for model development. The internal validation parameters that were used, represented models’ goodness-of-fit and robustness. Finaly, evaluation of 2D-QSAR model on two datasets, training set and test set: Internal and external validations were conducted. The internal validation used the leave-one-out (LOO) cross-validation to internally validate the QSAR model. This is done by excluding the point(s) of training set data, then constructing the model based on the remaining data activities and finally, using this model to test the excluded data. This process was repeated until the training set activities were predicted. The coefficient of cross-validated R2 (or Q2) was calculated for the training set. The external validation was using the model for prediction of the biological activities of test set. The value of predicted correlation coefficient (R2_pred) value was calculated for the test set.

Results and discussion

To conduct this study, database of 21 anti-haemozoin compounds was taken for building 2D-QSAR models (Table 1) to explore the structure–activity relationship of haemozoin inhibitors acting as anti-malarial agents. These compounds had in vitro anti-malarial activities against P. falciparum 3D7A and were used for QSAR modelling. The data set was randomly split into a training set (15 compounds) for model construction and test set (6 compounds), for validation of the model, respectively. The quality of a built QSAR model was demonstrated by the fitting and its predicting ability.

Variable selection

Five two-dimensional descriptors of the training set were selected for QSAR modelling as they all had low inter-correlation (Table 2). They included atom count (a_ICM); + adjacency and distance matrix descriptor (GCUT_SLOGP_2: the third GCUT descriptor using atomic contribution to logP (using the Wildman and Crippen SlogP method) instead of partial charge); average total charge sum (h_pavgQ) in pKa prediction (pH = 7); a very low negative partial charge, including aromatic carbons which have a heteroatom-substitution in “ortho” position (PEOE_VSA-0) and molecular descriptor (rsynth: estimating the synthesizability of molecules as the fraction of heavy atoms that can be traced back to starting material fragments resulting from retrosynthetic rules). The study demonstrated that the average total charge sum (h_pavgQ) in pKa prediction (pH = 7) was the most important descriptor with the correlation coefficient values of about 0.41 (Table 2).

Table 2 Correlation matrix for inter-correlation of five selected descriptors and their correlation with the bioactivity (pIC50) against P. falciparum 3D7A with the Pearson’s correlation coefficient values

QSAR model development

After selecting molecular descriptors, the linear QSAR models were built using the training set data. The outliers were checked and removed based on their values of PCA (principal component analysis), Z-score, and ZX-score of more than 2. There are four QSAR models that were developed based on the selection of different methods, namely PLS (Partial least squares) and PCR (Principal component regression), respectively with or without outliers (Table 3).

Table 3 Evaluation results of 2D-QSAR models generated by PLS (Partial least squares) and PCR (Principal component regression) methods

Validation of QSAR models

The evaluation of the QSAR models included the internal and external validations. The parameters for internal validation were R2 (a correlation coefficient), Q2 (predictive ability of the built QSAR models in the training set data employing leave-one-out (LOO) cross-validation method), and R2_pred (predictive ability for the test set). QSAR model is selected if it complies with the three criteria: the values of the high correlation coefficient (R2) between the experimental and the predicted values, the predictive ability of the model for the training set Q2 > 0.5, and the low standard deviation (RMSE). The comparison of four generated 2D-QSAR models were evaluated and compared in Table 3. The results showed that the QSAR models gave similar evaluation results by using PLS or PCR methods with outliers. This means that using different methods for the whole training dataset did not affect the development of the QSAR models. However, after removing outliers, the PLS model gave the better results, and the PCR model without outliers was worst than the others (Table 3). Therefore, the best QSAR model was the PLS model without outliers. The regression equation is represented as following:

$${\rm{pI}}{{\rm{C}}_{50}} = {\rm{ }} - 4.90988{\rm{ }} + {\rm{ }}1.98542{\rm{ }} \times {\rm{ a}}\_{\rm{ICM }} + {\rm{ }}0.74756{\rm{ }} \times {\rm{ GCUT}}\_{\rm{SLOGP}}\_2{\rm{ }} + {\rm{ }}0.59815{\rm{ }} \times {\rm{ h}}\_{\rm{pavgQ}}~ + {\rm{ }}0.00837{\rm{ }} \times {\rm{ PEOE}}\_{\rm{VSA}} - 0{\rm{ }}{-}{\rm{ }}0.12277{\rm{ }} \times {\rm{ rsynth}}.$$

where: R2 = 0.745031, RMSE = 0.166261, Q2 = 0.316410, and R2_pred = 0.9554.

The high values of R2 = 0.745; low standard error (RMSE = 0.166) and the good predictive ability: R2_Pred = 0.9554 (for the test set) indicated suitability of the model for predicting the anti-malarial activities of other haemozoin inhibitors from the existing anti-haemozoin compounds (Table 3). The experimental or observed versus predicted amounts of pIC50 of haemozoin inhibitors as anti-malarial structures against 3D7A strain were presented in Table 4 and Fig. 1. As can be seen in the Table 4, the predicted values of pIC50 values were in good agreement with the values of experimental pIC50.

Table 4 The values of selected descriptors and observed/predicted activities (pIC50)

The linear graphical plot was depicted in Fig. 2. The graph illustrated the good overlap of the observed and predicted activities of the data set with the high of correlation coefficient of R2 = 0.9554 (Fig. 2). The predicted values of pIC50 varied between − 1.021 to 1.222 with the value ranges of the selected descriptors presented in Table 5. The decrease of these descriptors led a decrease of pIC50 values meaning the increase of IC50 which is a decrease of anti-malarial activities.

Fig. 2
figure 2

Plot of the correlation between the experimental pIC50 and the pIC50 predicted anti-malarial activities using partial least squares model

Table 5 Values ranges of selected descriptors in 2D-QSAR model

Interpretation of descriptors

It was clearly inferred that the average total charge sum (h_pavgQ) in pKa prediction (pH = 7) contributed the most to the values of pIC50, which could be used as one indicator for predicting anti-malarial activities of other anti-haemozoin agents. The higher average total charge sum (h_pavgQ) in pKa prediction (pH = 7) resulted in increasing values of pIC50, or decreasing of IC50, indicating better anti-malarial activities (Table 4). The positive sign of these descriptors indicated that the larger the value of pIC50, the lower IC50 of the compound. In addition, this feature was also taken for evaluation and prediction of anti-malarial activities for some anti-malarial drugs, such as quinine, pyrimethamine, halofantrine and mefloquine. It was found that the higher their calculated h_pavgQ values, the better anti-malarial activities.

Furthermore, the decrease of distance matrix descriptor (GCUT_SLOGP_2) or the third GCUT descriptor using atomic contribution to logP could lead better anti-malarial activity. The result was compatible with the previous study as this descriptor represents for lipophilicity and low lipophilicity, especially at pH 3, 4, and 5 were significantly related to better anti-malarial activity of anti-haemozoin molecules.

In addition, the positive sign of the PEOE_VSA-0 descriptor, a very low negative partial charge, including aromatic carbons which have a heteroatom-substitution in “ortho” position suggests that increasing in the PEOE_VSA-0 will decrease the inhibitory potency of anti-haemozoin compounds. The increase of atom count (a_ICM), topological descriptors or structural components of the molecules have an effect on the variation of anti-malarial inhibitory activity of the anti-haemozoin compounds. Moreover, molecular descriptor (rsynth: estimating the synthesizability of molecules as the fraction of heavy atoms that can be traced back to starting material fragments resulting from retrosynthetic rules) was the least contributive. In addition, the predicted pIC50 in Table 4 were much different with the experimental pIC50 values for the outliers, especially C1, C2, C12, C16. As a result, removing these outlier compounds from the training set for building QSAR model was essential.

The limitation of this study is the toxicity evaluation. In fact, there is no model predicting both the structure–activity and the structure–toxicity relationships, but they are separate models either predicting the structure–activity or the structure toxicity. Therefore, this QSAR model is not suitable for predicting the toxicity of the compounds. Another QSAR model for toxicity is required.

Conclusion

With the 15 anti-haemozoin compounds, the satistically satisfactory 2D-QSAR model using PLS method was generated after removing the outliers. Five two-dimensional descriptors of the training set were selected: atom count (a_ICM); adjacentcy and distance matrix descriptor (GCUT_SLOGP_2: the third GCUT descriptor using atomic contribution to logP; average total charge sum (h_pavgQ) in pKa prediction (pH = 7); a very low negative partial charge, including aromatic carbons which have a heteroatom-substitution in “ortho” position (PEOE_VSA-0) and molecular descriptor (rsynth: estimating the synthesizability of molecules as the fraction of heavy atoms that can be traced back to starting material fragments resulting from retrosynthetic rules), respectively. The interpretation of the developed model suggests that the anti-malarial activity of haemozoin inhibitors increases with molecules having higher average total charge sum (h_pavgQ) in pKa prediction (pH = 7). The QSAR model also highlights that the descriptor using atomic contribution to logP or the distance matrix descriptor (GCUT_SLOGP_2), and structural component of the molecules, including topological descriptors does make for better anti-malarial activity.