In-silico Design of Aryl and Aralkyl Amine-Based Triazolopyrimidine Derivatives with Enhanced Activity Against Resistant Plasmodium falciparum

A blend of genetic algorithm with multiple linear regression (GA-MLR) method was utilized in generating a quantitative structure–activity relationship (QSAR) model on the antimalarial activity of aryl and aralkyl amine-based triazolopyrimidine derivatives. The structures of derivatives were optimized using density functional theory (DFT) DFT/B3LYP/6–31 + G* basis set to generate their molecular descriptors, where two (2) predictive models were developed with the aid of these descriptors. The model with an excellent statistical parameters; high coefficient of determination (R2) = 0.8884, cross-validated R2 (Q2cv) = 0.8317 and highest external validated R2 (R2pred) = 0.7019 was selected as the best model. The model generated was validated through internal (leave-one-out (LOO) cross-validation), external test set, and Y-randomization test. These parameters are indicators of robustness, excellent prediction, and validity of the selected model. The most relevant descriptor to the antimalarial activity in the model was found to be GATS6p (Geary autocorrelation—lag 6/weighted by polarizabilities), in the model due to its highest mean effect. The descriptor (GATS6p) was significant in the in-silico design of sixteen (16) derivatives of aryl and aralkyl amine-based triazolopyrimidine adopting compound DSM191 with the highest activity (pEC50 = 7.1805) as the design template. The design compound D8 was found to be the most active compound due to its superior hypothetical activity (pEC50 = 8.9545).


Introduction
Despite decades of efforts to curb and eradicate the malaria pandemic, it still retains its status as one of the most dangerous and fatal infections in the tropical and subtropical endemic countries, having over 500 million reported cases with approximately one million deaths yearly [1]. The disease is caused by Plasmodium falciparum (P. falciparum), the most lethal of all the five Plasmodium species, others are; Plasmodium malariae (P. malariae), Plasmodium knowlesi (P. knowlesi), Plasmodium falciparum (P. falciparum), Plasmodium vivax (P. vivax), and Plasmodium ovale (P. ovale) [2][3][4]. The protozoan parasite Plasmodium falciparum produced in the red blood cell (RBCs) requires various hosts for its growth and survival. The host is unable to manufacture purine rings de novo, hence purine is reclaimed from the host by the parasite [5]. Within the digestive vacuole of the parasite, the hemoglobin releases heme molecules that are toxic to the parasite. The heme is crystallized into hemozoin to nullify its toxicity. A large amount of hemoglobin is accumulated in the digestive vacuole [6]. The rate of heme release must equal to that of hemozoin crystallization for the survival of the parasite otherwise toxic heme buildup in the parasite [7].
Several compounds including chloroquine, artemisinin, amodiaquine, and quinine have been reported to have antimalarial activity and are employed in the treatment of malaria [8]. The treatment of malaria is faced with great challenges which include the rapid resistance developed by Plasmodium parasites to the available antimalarial drugs as well as the lack of an effective antimalarial vaccine. This resistance ensures that the available antimalarial drugs become ineffective in the treatment of pandemic disease. Hence, the need to develop new and novel antimalarial agents through chemical modification of existing compounds becomes a necessity [9].
Triazolopyrimidines identified with either lack of potency or poor plasma exposure due to metabolic stability were found to have improved potency as a result of the substitution of arylamine moiety [10]. In light of this, aryl and aralkyl amine-based triazolopyrimidine compounds reported for their antimalarial activity against the multi-drug resistant Plasmodium falciparum strain, 3D7 [10] could provide an alternative application to the orthodox antimalarial drugs.
Quantitative structure-activity relationship (QSAR) is an essential tool in the drug design field by reducing the time as well as the cost of drug development [11]. QSAR relates the physicochemical properties of congeneric compounds with the structural features responsible for their activity [12].
Several QSAR studies related to the design of antimalarial drugs were reported. In 2018, Hadanu and co. investigated the antimalarial activity of nitrobenzothiazole derivatives [13]. Beheshti and Co study the modeling of antimalarial activity of urea derivatives [14]. Prediction of anti-malarial activity in a set of new imidazolopiperazines using artificial neural networks was reported [9], but no systematic QSAR studies were conducted on a series of aryl and aralkyl amine-based triazolopyrimidine compounds as an antimalarial. This research is aimed at establishing a virgin QSAR regression model for predicting the antimalarial activity of 32 derivatives of aryl and aralkyl amine-based triazolopyrimidine compounds through the application of a hybrid of GA-MLR approach. Thereafter, we performed in silico design of a template (a compound with the highest activity), into new derivatives with enhanced activity against P. falciparum with the aid of the most contributive molecular descriptor determined by the descriptors mean effect, which saves time and cost enquired in the drug development analysis.

Sources of Data
The dataset of thirty-two aryl and aralkyl amine-based triazolopyrimidine derivatives consisting of their structures and activities (Table 1) were extracted from the literature [15]. Their biological activities were expressed as pEC 50 (-Log 10 drug concentration to cause a 50% reduction in cell proliferation) suitable for QSAR analysis (Tables 2 and 3).

Geometry Optimization
The structures of the compounds were drawn with the help of ChemDraw software, while Spartan 14 software (Spartan 14v114) was utilized in the optimization of the molecular geometries [16] based on the density functional theory (DFT) level using Becke's three parameters Lee-Yang-Parr hybrid functional (B3LYP) in combination with the 6-31 + G* basis set. The optimized Spartan 14 structures are saved in SDF format.

Descriptors Calculation
The lowest energy-optimized geometries were imported into the PaDel-Descriptor version 2.20 for descriptors calculation. About 1500 varieties of molecular descriptors were calculated for each molecular geometry present in Table 1

Dataset Division
The dataset was split in the ratio 78:22 with 25 training compounds (78% of the data set) and 7 test compounds (covering the remaining 22%) with the aid of the DatasetDivision1.2 program [18] by employing the Kennard-Stone's algorithm method.

QSAR Model Development
The training set was utilized in the development of the QSAR model, where the activities (dependent variable) and the molecular descriptors (independent variables) were exposed to model development by the Genetic Function Approximation (GFA) method contained in the material studio software. The Genetic algorithms (GAs), a copycat of biological evolution, are currently the ultimately used variable selection method that addresses both the constrained and unconstrained optimization challenges based on an essential selection process [19,20]. GA is a prying search method that looks for the correct or approximate solutions to any optimization challenges [21]. Its ability to produce more than one blend of descriptors useable in model construction is an added advantage. With GAs, users have the liberty of imposing the equation length and employs a lack of fit (LOF) function to check over-fitting in the models produced. The lack-of-fit (LOF) was calculated in the material studio to estimate the model fitness using Eq. 1.
where SSE stands for the sum of squares of errors, c, number of terms other than the constant in the model, d stands for the user-defined smoothing parameter, p is the number of descriptors in the model, and M, represent the samples counts in the training set.

Multi-co-linearity analysis
The possibility of a huge degree of correlation among the model descriptors was evaluated using a variance inflation factor (VIF). For a descriptor, i, the VIF i was calculated from Eq. 2. (2) where R 2 ij represents the multiple regression's correlation coefficients between the descriptor i and the remaining j model descriptors.

Internal Validation
The leave-one-out (LOO) cross-validation technique was deployed to internally validate the model. It involves the elimination of one compound from the data set at random in each cycle and the model is built with the remaining compounds. The constructed model is used for predicting the activity of  where Yobs is the observed activity of the training set, Ypred represents the predicted activity of the training set and Ŷ is the mean observed activity of the training set.

External Validation
The developed model was externally validated to determine its predictive strength, through predicting the activity of some dataset (test set) separated from the training set. This is achieved using Eq. 4 below.
where Y obs (test) and Y (test) indicates the test set predicted and observed activity values respectively, Ŷ training , indicates training set means activity. R 2 pred is the predicted coefficient of correlation estimated from the predicted test set activity.

Y-Randomization
The possibility of the model occurring by chance was determined through the Y-randomization test, which also ascertains the robustness of the developed model. The Y-Randomization test involves constructing a model with a randomizing the activities of the training set. A detour in the square of the mean correlation coefficient of the randomized activity model (R 2 r) from that of the non-randomized activity model (R 2 ) is reflected in the value of R 2 m parameter calculated with help of the Eq. 5 [22].
Normally, the average value of R 2 r for the randomized activity model is expected to be zero, implying equal values of R 2 m and R 2 r for the developed model, necessitate the use of cRp 2 in place of R 2 m. A model is deemed not due to a chance when the coefficient of determination (cRp 2 ) is found to be greater than 0.5 [23]. The value of cRp 2 was calculated for each randomized activity model with the aid of Eq. 6 given below.

Model Applicability Domain
The applicability domain (AD) provides an assessment of the model in terms of its ability to predict the activity of unknown compounds excellently in form of chemical space. Every model is only able to predict the activity of a certain number of compounds; such prediction is conditioned on the satisfaction of the assumptions applied in the construction of the model [24]. The chemical space of the AD is designed by plotting the values of leverage against the standardized residuals (Williams plot) of the entire data set. The leverages are obtained from the diagonals of a hat matrix obtainable from Eq. 7 [17].
where, H i , is the hat matrix of the training/test set, X i is the original matrix of training/test set, and X T i is the transpose matrix of the training/test set.
Being a prediction tool, the AD has warning leverage (h*) defining the limit of normal values for X outliers and it's defined as h * = 3(z + 1)∕n , where n is the number of the training set, and z is the number of descriptors in the model. Only compounds with leverages less than or equal to the warning (h*) are considered to be within the chemical space and their activities are reliably predicted by the model. In addition to the AD, a plot of the normalized mean distance of the model descriptors against the standardized residuals of the predicted activities of the data set (UZAIRU'S plot) was employed to detect outliers present in the data set [25].

Mean Effect (MF)
The model descriptor's relevance was analyzed by evaluating the descriptors mean effect (MF). The contribution by individual descriptor to the model are reflected by the size as well as the signs of their mean effect. The most contributive descriptor has the largest mean effect, while the small mean effect size is for the least contributive descriptor. The mean effect is evaluated from Eq. 7.
where j is the coefficient of j, D j is the value of each training set in the descriptor matrix and m is the tally of model descriptors and n is the tally of molecules used as the training set [26].

Molecular Design
The relevance of the model's descriptors cannot be overemphasized as such the interpretations of the most contributive descriptor could provide a guide to design several hypothetical derivatives of aryl and aralkyl amine-based triazolopyrimidine having enhanced antimalarial activities. This is made possible by selecting the compound with the highest activity as a design template. The structural feature of the most informative descriptor forms the basis of manipulating the design template by substituting group(s) based on the information from the size of the descriptors mean effect, to design some theoretical compound possessing higher antimalarial activity than the design template through replacing atom/group of atoms at different positions on the template. The design derivatives were then optimized and their descriptor calculated from which the antimalarial activity was obtained.

QSAR Studies
Two QSAR models developed through the genetic function algorithm analysis of the datasets were as presented in Table 4:

QSAR Model Selection
The second tetra parametric equation (model II) was announced as the best QSAR model based on its statistical data obtained from multiple linear regression (MLR) of the material studio. The model was used in evaluating the theoretical activity of the aryl and aralkyl amine-based triazolopyrimidine derivatives used in the analysis. The model may not have the highest R 2 (0.8884) compared to that of the model I (0.8931), but its higher square coefficient of correlation for the test set (R 2 ext = 0.7019) which is in line with the prerequisite for a dependable, foretelling and robust QSAR model [27] ensures model II emerges as the best model. The definition of the descriptors in the model as well as their class were reported in Table 5.
The meager residual values of the data set (Tables 1, 2, and 3) calculated by the best model, is an indication of a good agreement between the observe and predicted activity as well as the reliability of the model, this was supported by the distribution of the data set around the legend, displayed in the plot of experimental pEC 50 vs predicted pEC 50 presented as Fig. 1.

Model Validation
A significant QSAR model is one that can predict the activities of an external data set, (R 2 ext ) not just in its ability to clone the training set (Q 2 LOO ). Hence, the training set prediction was affirmed by the leave-one-out (LOO) cross-validation method. The large Q 2 LOO value, 0.8317 calculated is a good indication of an excellent internal validation. The results of the external validation, R 2 ext = 0.7019, show the predictive strength of the model and hence its robustness. Analysis of the correlation matrix shown in Table 6, shows a large value of correlation coefficients between the descriptors and the activity, an indication of a great interrelation between the two. Whereas the low correlation coefficients between the descriptors themselves indicate noncollinearity between the descriptors. The descriptor's multicollinearity is explained with variation inflation factors (VIF). The VIF values of the model's descriptors as reflected in Table 6, all fall within, 1 < VIF ≤ 5 range, which is a VIF criterion for Moran autocorrelation-lag 6/weighted by first ionization potential 2D 2 GATS5e Geary autocorrelation-lag 5/weighted by Sanderson electronegativities 2D 3 GATS6p Geary autocorrelation-lag 6/weighted by polarizabilities 2D 4 RDF30m Radial distribution function-030/weighted by relative mass 3D model acceptance [28,29]. The results of the VIF confers orthogonality and acceptability on the model. The outcome of the Y-randomization inquiry after scrambling the activity ten folds shows that the R 2 and Q 2 values for the scrambling activity model are significantly low, confirming the robustness of the model. Table 7, shows the Y-randomization parameters with cR 2 p greater than 0.5 reaffirming the model's quality. The standardized residuals plotted against the calculated leverage values (Williams plot) gives a graphical identification of both the outliers and the influential substances within a model. The applicability domain (Fig. 2), is entrenched within ± 3 standardized residual range and a leverage threshold (h* = 0.6). Figure 2, clearly shows no compound of the data set beyond the standard deviation, ± 3d, and a leverage value, h > 0.6. Hence, the entire data set contain neither an outlier nor an influential substance. Furthermore, the Uzairu's plot (the normalized mean distance of the model descriptors versus the standardized residuals of the predicted activities of the data set) presented in Fig. 3, also confirms the absence of an outlier.

Interpretation of Descriptors
In this research, the four molecular descriptors chosen by GA play a significant role in the discovery of new drugs for the treatment of malaria. The contributions of each descriptor to the model were carried out in the MF analysis displaced in Fig. 4. The first two descriptors are MATS6i and GATS5e, which are both 2D-autocorrelated descriptors obtained from a molecular graph through the addition of the atomic weight of the terminal atoms of the

Experimental (pEC 50 )
Training Set Test Set Linear (Training Set) Fig. 1 The predicted pEC 50 against the experimental values for the data set considered path length. While MATS6i descriptor encodes the ionization potential of the substituted aryl amine-based triazolopyrimidine analogs, it expresses the tendency of an electron cloud or molecule to interact with a target [30], from the MF analysis, the descriptor has the least contribution (MF = 0.00298) to the antimalarial activity and its positive sign is an indication of a direct relationship with the activity. The GATS5e relates to the electronegativity of the compound which has an inverse relation with the activity as a result of the MF sign.

Standardised Residuals
Training Set Test set The third descriptor, GATS6p is also a 2D descriptor [31] that reflects the topological structure of the molecules. The difference in the atomic polarizabilities of the molecule is explained by the descriptor [32]. This descriptor is the most relevant with MF = 1.66902, and its positive MF sign indicates increasing the value of the descriptor, increases the antimalarial activity.
The last descriptor, RDF30m indicates the existence of some linear dependence between the inhibitory activities of aryl amine-based triazolopyrimidine derivatives and the 3-dimensional molecular distribution of mass determined at a radius of 4.0 Å from the centers of all molecule. The descriptor's negative MF is a reflection of its indirect relation with the activity. A decrease in the value of the descriptor increases the antimalarial activity.

Molecular Design
The compound with the highest antimalarial activity, DSM191 (pEC 50 = 7.1805) was exploited as the design template. The most contributive descriptor, GATS6p (MF = 1.6690) influences the antimalarial activity of the derivatives of aryl and aralkyl amine-based triazolopyrimidine greatly, and relate to the atomic polarizabilities of the molecule. With the positive mean effect, antimalarial activity increases with an increase in the descriptor value (atomic polarizabilities) of the molecule. Atomic polarizability increases with an increase in the number of electrons (leading to a reduction in the effective nuclear charge). Therefore, various electron-donating groups such as; -CH 3 , -CH 2 CH 3 , -COCH 3 , -OCH 3 , -OH, -NH 2 , and -N(CH 3 ) 2 were added to the design template at ortho/para positions to release electron density into the ring thereby creating derivatives of the template which increase the polarizability and by extension, increase the antimalarial activity of the designed derivatives. Sixteen (16) derivatives of the template were designed, out of which nine of the designed compounds (2D, 4D, 8D, 11D, 12D, 13D, 14D, 15D, and 16D) have improved hypothetical antimalarial activity than the design template as well as Chloroquine (pEC 50 = 4.6352) and Pyrimethamine (pEC 50 = 6.2593) standards as shown in Table 8. Compound D8, N 7 -(benzo[b]thiophen-5-yl)-N 6 ,N 6 ,5-trimethyl- [1,2,4] triazolo[1,5-a]pyrimidine-6,7-diamine was identified with the highest antimalarial activity (pEC 50 = 8.9545) as reflected (Table 8), hence is the most active designed derivative (Fig. 5).

Conclusions
This research is focused on designing antimalarial derivatives through the developed QSAR model used for predicting the activities of aryl and aralkyl amine-based triazolopyrimidine derivatives. The PaDel-Descriptor version 2.20 software was employed to calculate several descriptors used for the regression analysis carried conducted by the GA-MLR process and the developed model validated comprehensively. The selected model revealed descriptors; MATS6i, GATS5e, GATS6p, and RDF30m played a significant role in the antimalarial activity of the compounds. The mean effect analysis indicated GATS6p as the most contributive descriptor (MF = 1.6690) which estimates the polarizability of the molecule whose positive values increasing the activity of the compounds. Sixteen compounds were designed using DSM191 (a compound with the highest antimalarial activity) as the design template and various electron-donating groups used to increase the atomic polarizability were directed at the ortho/para positions of the aromatic rings of the template. Compound D8, N 7 -(benzo[b]thiophen-5-yl)-N 6 ,N 6 ,5-trimethyl- [1,2,4] triazolo[1,5-a]pyrimidine-6,7-diamine, was found to have the highest hypothetical activity (pEC 50 = 8.9545) than the template (pEC 50 = 7.1805) and the fellow designed compounds. These designed compounds when synthesized could be clinically validated for antimalarial treatment. Hence, the studies will reduce trial and error as well as minimize the time and cost involved in drug design. Furthermore, in the future, we could conduct the molecular docking studies of the design compounds against their target to observe the nature of interactions responsible for the antimalarial activity. Consent for Publication On behave of the authors, I hereby granted the right of this entire article content to this journal.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.