Introduction

New drug compounds active against pathogenic organisms or parasites are discovered through a very rigorous process involving many stages and huge human and material resources input. Besides, drug discovery takes a long time (up to 12–15 years), making it difficult to introduce new drugs to combat emerging or resistant strains of existing diseases [1, 2]. This process involves the identification of candidates, synthesis, characterization, validation, optimization, screening and assays for therapeutic efficacy. Since the introduction of Artificial Intelligence (AI), many processes have been made easier and faster than before, because of the ability of the models utilized to handle an unprecedented cache of data within a very short time [3]. Thus, the application of AI to drugs and development is a welcome development as it is expected to shorten the time to market many drug candidates found to be active against parasites and pathogenic organisms. A subunit of AI known as Machine Learning (ML) has been widely applied to Drug Discovery and Development (DDD) [4].

DDD pipelines are long, complex and depend on numerous factors. ML approaches provide a set of tools that can improve discovery and data-driven decision-making for well-specified questions with abundant, high-quality data [5]. The growth of High Throughput Screening (HTS) data has increased the importance of ML tools at virtually all phases of drug discovery. ML has the potential to speed up the process and reduce failure rates in DDD [5]. These patterns form the basis for building models that are effectively applied to prioritize compounds for the subsequent phases. ML techniques can assist in the identification of false leads at an early stage and also facilitate the understanding of structure–activity relationships (SARs) [6].

This paper presents a systematic review on the use and prospects of ML in predicting, classifying and clustering IC50 values of compounds active against P. falciparum. Fundamentally, ML is the practice of using classification, regression or clustering algorithms to describe data, learn from it and then decide or predict about the future state of any new dataset. Classification is the process of recognizing, understanding and grouping ideas and objects into preset categories or sub-populations [2]. Using pre-categorized input training datasets, ML uses a variety of algorithms to classify future datasets as shown in Fig. 6A. Classification algorithms are predictive calculations used to assign data to preset categories by analyzing sets of training data [7]. Predictive computational models enable one to understand the correlation between descriptors and the biological properties (activities), that is, to computationally screen large molecular datasets thereby offering a possibility to improve the hit rate and thereby reducing the overall costs of drug discovery [8]. Due to the constant emergence of parasitic resistance to the current antimalarial drugs, the discovery of new drug candidates is a major global health priority [9, 10]. Previous works in ML-based tropical diseases research, including malaria and other diseases, have shown effectiveness in drug discovery [11]. In previous studies also, several algorithms have been employed in classifying the IC50 value of compounds against P. falciparum including Decision Tree (DT), K-Nearest Neighbors (KNN), Artificial Neural Networks (ANN), PLS Discriminant Analysis (PLS-DA). These approaches have shown statistical significance in performance [12].

Malaria as a global challenge

Global burden of malaria

Malaria remains one of the most life-threatening diseases caused by the blood-borne protozoan parasites of the genus Plasmodium. Five species of Plasmodium are known to cause one form of infection or the other to humans across the globe [13, 14]. According to the World Health Organisation (WHO), P. falciparum is the most deadly, most common causative agent and also the most prevalent species in sub-Saharan Africa. Southeast Asia, Western Pacific, Eastern Mediterranean and Latin America are currently burdened by P. vivax [15]. Malaria due to P. ovale leads to life-threatening symptoms but was previously considered benign. P. malariae, like P. ovale, malaria is not severe in humans, while P. knowlesi is the most prevalent species in Southeast Asia. Symptoms of malaria vary from species to species; however, paroxysms, anemia and headaches are common in all cases of human malaria infection. P. falciparum results in respiratory distress, deep capillaries blockade, cerebral malaria and neurological disorders and eventual death, if untreated [16]. Malaria is transmitted through the saliva of the female anopheles mosquitoes. The transmission cascade is complex and involves both sexual and asexual stages in mosquitoes and humans [14].

The global trends in malaria incidences show somewhat dramatic, complex and highly unpredictable episodes. Malaria is still endemic in 87 countries with 29 countries accounting for about 95% of all recorded cases [15]. The leading most endemic countries—Nigeria, the Democratic Republic of Congo (DRC), Uganda, Mozambique and Niger—accounted for about 51% of all cases globally [15]. The WHO estimated that 409,000 malaria deaths occurred globally in 2019; 67% of which were recorded in under 5 year old children, 95% in 31 countries and 23% in Nigeria alone [17]. Despite the global burden of malaria, it is estimated that over a billion cases and millions of death have been averted in the last 20 years. A total of 82% of cases and 94% of deaths were averted in sub-Saharan Africa alone. In most cases, pregnant women and children under 5 years are the worst hit [18, 19]. According to the WHO report, over 12 million pregnancies were exposed to malaria infection during pregnancy; Central Africa, West Africa as well as East and Southern Africa recorded 40, 39 and 24% prevalence of exposure to malaria, respectively, in 2019 [15]. This high prevalence of exposure results in low birth weight in most cases. To save the future, there is a need to save children under 5 years and pregnant women from the menace of malaria.

Malaria prevention and control through investments in research

To prevent malaria and checkmate re-infection, several programs were designed in the past. One of such programs is the High Burden High Impact (HBHI) approach launched by the WHO in 2018 [15]. Though the launching and/or implementation was disrupted in some high burden countries due to the ravaging COVID-19 pandemic, these programs have been fruitful in some cases. For instance, from 2000 to 2019, the prevalence of P. falciparum malaria in Cambodia, Myanmar, Vietnam, Thailand and China was reduced by 97%, while countries previously certified malaria-free did not have any transmission or re-infection [17]. However, a global outlook showed that 20 more countries were added to the list of endemic countries within the period under review. Unfortunately, there are also disjointed data on sub-Saharan Africa’s improvement, but reports suggest a total of 215 million cases in 2019 up from 204 million in 2000 [20]. However, malaria case incidence per 1000 population at risk reduced from 365 in 2000 to 225 in 2019 further reflecting the complexity in demographic data in such a rapidly growing population. Notwithstanding the effect of the COVID-19 pandemic, the HBHI approach has kicked off in 10 of 11 malaria-endemic countries in sub-Saharan Africa. The impact, however, is yet to be felt region-wide as the number of cases in the 11 HBHI countries in 2019 (156 million) was similar to 2018 (155 million). Expectedly, the WHO Global Malaria Programme (GMP) foresees positive outcomes from this approach shortly following an aggressive commitment to adhere to the evidence-based recommendations developed by the WHO [15, 21].

More readily available options for malaria prevention and eradication are in the form of investments in malaria programs and research as contained in the Global Technical Strategy (GTS). The strategy is aimed at reducing mortality rate and malaria case incidence by 40, 75 and 90% in 2020, 2025 and 2030, respectively, which, at the time of launching in 2015, did not take into consideration the potential disruption due to the COVID-19 pandemic. Several players such as Global Fund as well as Melinda and Gates Foundation had invested immensely in malaria programs for research and development of malaria drugs, vaccines, diagnostic tools and vector control products. Some of the investments had yielded what today have become milestones in malaria treatments and prevention.

Current malaria control and treatments strategies

Chemoprevention and chemotherapy are the two major approaches known to reduce the burden of malaria in humans. Chemoprevention involves vector control (indoor residual spraying and insecticide-treated mosquito nets), which is recommended by the WHO to prevent malaria transmission. Indoor Residual Spraying (IRS) with insecticides is a powerful vector control approach, which involves spraying inside houses with insecticide once or twice a year. Sleeping under Insecticide-Treated Nets (ITN) reduces malaria cases by providing insecticidal effects and physical barriers to mosquitoes [22]. These chemopreventive measures are limited in application, coverage and effectiveness thus the reliance on the chemotherapeutic approach. Ever since the discovery and development of quinine from the Peruvian Amazon Cinchona species during the nineteenth century, several antimalarial drugs have come into existence for chemotherapeutic purposes [23]. The use of drugs for this purpose depends on prevalent Plasmodium species, demography, age, sex and the affected region. For example, travelers rely on chemoprophylaxis for the prevention of malaria. The WHO has recommended a minimum of three doses of intermittent sulfadoxine/pyrimethamine for pregnant women in endemic regions. For children under 5 years in the endemic region, during the season of high transmission, the administration of monthly courses of amodiaquine in addition to sulfadoxine/pyrimethamine is recommended [15]. Currently, ACTs have remained the first-line treatment for malaria. These ACTs and other drugs are currently in use across different WHO regions, and their effectiveness has brought malaria prevention and treatment to where we currently are; a lot more still needs to be done to close the gap.

The previous and ongoing antimalarial discovery

Natural-products inspired approach

Plant-based products have shown promising potential as antimalarial agents and are the source of the two most important antimalarial drugs currently in use. Quinine, the first antimalarial agent, was characterized in 1820 by French Chemists. It was isolated from the bark of Peruvian Amazon Cinchona calisaya and C. succirubra (Rubiaceae) for the treatment of P. falciparum malaria [24]. Despite the continual use of quinine in chemotherapy, its effectiveness is hampered by the toxicity when used for a long period. Another plant-based compound still in use is artemisinin, a sesquiterpene endoperoxide from Artemisia annua of the Asteraceae family. Artemisinin, an unusual endoperoxide sesquiterpene lactone, was isolated by Chinese Scientists in 1972 and has been in use against chloroquine-resistant P. falciparum [24]. Though an alternative to quinine, some problems are associated with artemisinin such as recrudescence and high cost. The search for the ideal antimalarial drugs has continued, and several other compounds with antimalarial activity isolated from plants have been reviewed extensively [25,26,27,28,29,30,31,32,33,34].

Synthetic and semi-synthetic approach

Following the characterization of Cinchona alkaloid, quinine in 1820 for the treatment of complicated P. falciparum malaria, several other 4-aminoquinolines were synthesized based on the quinine ring nucleus. One of the 4-aminoquinolines was chloroquine, which is cheap and less toxic and has been a component of the global malaria eradication campaign. However, P. falciparum chloroquine-resistant strains were discovered in Latin America and Southeast Asia and have spread to most of the WHO endemic regions. Like quinine, artemisinin modification has led to the synthesis of several high potent analogues for further development [35]. In a study, some compounds primarily sulfonamides sourced from the Glaxo-Smithkline (GSK) selectively inhibited the in vitro growth of P. falciparum at the submicromolar level (IC50, µM, 0.16–0.89). The inhibition, however, did not correlate with the known carbonic anhydrase enzyme inhibition by primary sulfonamides [36]. SAR was established for 1,2,3-triazole-naphthoquinone analogues synthesized by a Cu(I)-catalyzed Huisgen 1,3-dipolar cycloaddition reaction against chloroquine-sensitive P. falciparum F-32 Tanzania [37]. It was found that the nature of substituents on the aromatic ring greatly influenced the antiprotozoal activity and further confirmed that the enzyme PfDHODH was the target of these compounds. Violacein, an indole pigment synthetically engineered from E. coli, was found to significantly affect the P. falciparum actin cytoskeleton [38]. Many traditional methods of antimalarial drug discovery such as optimization of existing therapy, analogue of existing therapy, drug resistance reversers and active compounds against new targets are known. These approaches have been replaced by modern methods such as target and ligand based.

The computer-aided drug design approach

Traditionally, the High Throughput Screening (HTS) method is used in drug discovery and it involves extensive experimental testing of a library of compounds against selected targets. It is a time-consuming and very expensive approach to drug discovery. The computational (virtual) approach has replaced HTS and involves in silico screening of large datasets for hit identification and subsequent design and optimization. This approach also enables the identification of compounds yet to be synthesized or commercially available [34, 39].

(A) Ligand-based approach

The ligand-based approach in drug discovery is designed to retrogressively analyze biological activity data, and different ligand-based approaches have been developed and validated to understand the nature of structural or chemical parameters involved in the antimalarial activity. Previous studies had applied Quantitative Structure–Activity/Property Relationship (QSAR/QSPR) in understanding the contribution of different structural features to the antimalarial activities and further predicted the activities of yet-to-be synthesized molecules [40,41,42]. Specifically, the applicability of the ligand-based approach has been tested on several synthetic prodiginines, 3-carboxyl-4(1H)-quinolone analogues, side-chain modified 4-amino-7-chloroquinolines, artemisinin derivatives, 7-substituted-4-aminoquinoline derivatives, 4-anilinoquinolines, quinine-based active agents as well as several natural products [34]. The flowchart of building a typical 3D-QSAR model is shown in Fig. 1. Typically, the low-energy conformers of a dataset for building a robust QSAR/QSPR are subjected to alignment based on the biophore hypothesis. This is followed by modeling, internal validation of the model and prediction of untested compounds. The approach (Fig. 1) also provides contouring from the model’s coefficient of regression for the futuristic design of potential new bioactive molecules or modification of available molecules for a better activity. This approach becomes more relevant where drug targets are not available or unknown.

Fig. 1
figure 1

Workflow for ligand-based drug discovery (LMD—low mode dynamic; AM1—Austin model 1 Hamiltonian). Four major steps: dataset pretreatment, alignment, modeling and visualization are important here

(B) Structure-based approach

The structure-based approach involves a drastic reduction in the number of compounds in chemical space to a few hits having properties suitable for interacting with the target receptor. In this approach, a 3D structure of the target or a homologous protein must be known, several of which are available and freely accessible at the protein data bank (PDB, pdb.org). The workflow for the structure-based approach is shown in Fig. 2.

Fig. 2
figure 2

Workflow for a structure-based drug discovery approach (binding modes and mechanisms of molecules to specific amino acid residues of the receptor targets are obtained in this approach)

The first and critical step involved in the structure-based approach is the identification and validation of targets involved in the pathogenesis of malaria. Several targets have been identified in Plasmodium species for structure-based drug design as shown in Fig. 2 [43,44,45,46,47,48]. These targets were routinely used in the identification of single-target therapy where one antimalarial drug is used throughout malaria treatment. This form had led to the emergence of resistance. Recently, a multi-targeting hybrid approach that involves the modulation of several targets by one compound has been developed [49,50,51,52,53]. This involves artemisinin-based hybrid, quinoline-based hybrids, paclitaxel-based hybrids and target-based approaches via HTS in hybrid design [34].

The gap in malaria prevention and treatment

Despite the plethora of approved antimalarial drugs, natural products-inspired and synthetic compounds so far identified as antiplasmodial, malaria still ranks closely with tuberculosis and HIV/AIDS. Various malaria prevention approaches have suffered some setbacks in recent years. Even though over 46% of Africans were protected from malaria by ITN in 2019, ITN coverage was stopped in 2016. More so, IRS protection has consistently declined from 5% in 2010 to 2% in 2019 across many WHO regions. The decline in protection was attributed to resistance developed by Plasmodium to pyrethroid IRS, which has forced countries to switch to more expensive insecticides.

Another gap widely documented is the issue of resistance to standard antimalarial drugs. The resistance of Plasmodium to chemotherapeutic agents was first observed in the 1950s and 1960s in chloroquine and sulfadoxine/pyrimethamine, thus reversing initial gains made in malaria control efforts [54, 55]. Similarly, partial resistance to artemisinin due to PfKelch13 mutations has been reported and it is still under study. These gaps have further been widened by the emergence of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2), causing COVID‑19, which had spread to all malaria-endemic countries resulting in over 30 million cases and 1 million deaths as of March 2021 [15].

Worse still, no effective vaccine is available at the moment to fight malaria in humans. Many collaboration among global health funding bodies to develop vaccine has not yielded the desired result. Specifically, a vaccine against P. falciparum, RTSS/AS01E is developed and has shown 40% effectiveness in preventing malaria infection. While efforts to develop a vaccine against Plasmodium species are ongoing, a new strategy is needed to tackle the ever-changing landscape in malaria infection and treatment. One of such strategies is the use of ML tools.

The role of AI in malaria drug discovery

The need for the discovery of new malaria drugs cannot be overemphasized; this is because the P. falciparum parasites have successfully developed resistance against many drugs that are available [54, 55]. Malaria drug discovery involves the following stages: (1) target selection and validation; (2) compound screening and lead optimization; (3) pre-clinical studies and (4) clinical trials. Applying these steps in the traditional approach to drug discovery is very expensive and requires a lot of time; therefore, in recent times, drug discovery steps have focused on computational approaches [56]. The computational approaches to drug discovery use Artificial Intelligence techniques. In this section, we shall explore various Artificial Intelligence techniques in various steps of malaria drug discovery. Table 1 summarizes various AI techniques used in various stages of malaria drug discovery [56].

Table 1 Different AI tools used in various stages of malaria drug discovery

In another study, other researchers presented various AI and ML techniques used in various stages of drug discovery, together with the methods and level of accuracy [72]. This is summarized in Table 2.

Table 2 An overview of some studies that used AI for drug discovery [72]

The same study in [72] also summarized various problems at various stages of drug discovery that various novel approaches to AI have been used to solve, and presented the summary (Fig. 3).

Fig. 3
figure 3

Outcome of different novel approaches to AI

ML approach

ML is an aspect of Artificial Intelligence (AI), which helps you to acquire knowledge through experience. Experience in this context means data, while knowledge in this context means the ability to solve a problem, which can be a prediction of the continuous IC50 values of chemical compounds against P. falciparum, prediction of the bioactive class of chemical compounds against P. falciparum, etc. The experience here is the use of a chemical dataset, with various descriptors to predict the anti-Plasmodium. ML tasks can be classified as a prediction of continuous values (regression), prediction of classes (classification) and grouping of similar data items (clustering).

ML as a clustering tool

The clustering algorithm is one of the unsupervised ML algorithms, which identifies groups of similar data in a dataset [73]. In the case of the molecular compounds dataset, clustering can be used to identify compounds that have similar chemical properties. Clustering dataset into groups of similar items can take any of the following:

  1. (a)

    Exclusive clustering: The dataset belongs to only one group.

  2. (b)

    Overlapping clustering: The dataset can belong to more than one group.

  3. (c)

    Probabilistic clustering: The dataset belongs to any group with a known probability.

  4. (d)

    Hierarchical clustering: The dataset is split into groups of similar data in a hierarchical manner. For example, the dataset can be split into two main groups, male and female. In each main group, it is refined into subgroups, like age groups, and each age group can be split into smaller subgroups, etc.

One classic ML clustering algorithm that is based on Euclidean distance is the K-means clustering algorithm. Figure 4 illustrates the exclusive clustering, while Table 3 shows probability clustering.

Fig. 4
figure 4

Example of exclusive clustering

Table 3 Example of probabilistic clustering

The importance of clustering compounds by structural or property similarity cannot be overemphasized. It provides a powerful approach to correlating compound features with bioactivity [7]. It can also be used for diversity analysis, for identifying compound redundancies and other biases in compound libraries [7]. Clustering has been used as an ML tool in analyzing molecular compound datasets for IC50 values against P. falciparum. [3], extracted the three most common targets from MacrolactoneDB, which are P. falciparum (malaria) [3], Hepatitis C and T-cells. Cheminformatics analysis was conducted on them and an ML workflow was developed. Unsupervised hierarchical clustering was conducted using Euclidean distance. The purpose or basis of the clustering in [3] was to be able to identify compounds that share similar chemical properties but different structural fragments, which resulted in different IC50. Furthermore, clustering could help point to the importance or relevance of descriptors, based on whether they can cluster compounds with similar activities. The result of the clustering in [3] shows that the P. falciparum dataset has two clusters, which suggests two groups of compounds; each group shares similar chemical properties, but different structural fragments, which contributed to different IC50 values. The relationship between this clustering results in [3] and different structural fragments is the concept of Activity Cliff. This concept of Activity Cliff is very useful as a curative tool when preparing chemical datasets that have activity on P. falciparum. It can be used to separate chemical compounds with similar chemical properties but different IC50 values on P. Falciparum, thereby leading to a high-quality chemical dataset. This was demonstrated in the study, using a clustering metric for similarity measurement (Tanimoto) of 0.87 and IC50 difference measure of 11.99 nM. This is called an Activity Cliff, because there is a great disparity between the IC50 of the compounds, despite having similar structures [74]. The same clustering metric (Tanimoto) was used to measure the similarity between the training data and test data, to apply a semi-supervised ML framework [75]. However, another study [76] used the similarity measure (Tanimoto coefficient) that is greater than different threshold values for different fingerprint similarity searching methods, to search for compounds whose IC50 against P. falciparum falls within specific values [76]. On the other hand, to avoid selection bias, clustering was used to establish even the assignment of chemical features into a training set and test set [77, 78]. This was done by dividing the molecules into clusters of ten molecules using hierarchical clustering. However, to select the most appropriate base model to be used to analyze a given chemical dataset, clustering was used to accomplish this [79]. To visualize the result of the clustering, the authors generated a chemical network of the compounds using Gephi. Each node of the chemical network was a micro-lactone ligand.

Activity Cliff

Activity Cliff is related to clustering compounds with similar structural properties. It has been defined as a pair of compounds with similar structural property, but with different potency (activity) against a known target [40]. Activity Cliff plays an important role in medicinal chemistry and chemo-informatics, because, in structure–activity relationship analysis and optimization, small chemical modification can be deduced from cliffs with high value in magnitude [40]. Furthermore, as part of the curative process, an activity cliff has been used to prepare a chemical dataset by removing pairs of compounds with high structural similarity but unexpectedly high activity difference [79]. This is to ensure that pairs of compounds with high structural similarity have similar activity on the target when using the dataset for QSAR. Based on the definition of activity cliff, four key components of activity cliff can be identified, which are: only a pair of compounds is considered, both compounds are active against the same known target, a structural similarity criterion must be specified, and potency difference criterion must be established [40]. Tanimoto value is the commonly used measure for measuring the structural similarity index between two compounds, while IC50 or Ki can be used for the potency measure of the two compounds.

Clustering algorithm for chemical compound datasets

Clustering has been used in the pharmaceutical industry to create different training datasets and test datasets as well [80], though the most commonly used clustering algorithm is Jarvis–Patrick’s (J-P) clustering algorithm for clustering molecules of a chemical dataset. However, it has its associated problems, which include: it produces clusters that are either too large, in terms of the number of molecules in the clusters, but heterogeneous (small Tanimoto similarity value). It also produces clusters that are too small in terms of the number of molecules in the clusters but homogeneous (high Tanimoto similarity value) [80]. Based on these problems, other researchers [80] developed another clustering algorithm, which was able to create homogeneous clusters (high Tanimoto similarity value) and, at the same time, deal with either too small or too large molecules in each cluster. The clustering algorithm that he developed follows these three steps: (a) generation of daylight fingerprints (ASCII), (b) identification of potential cluster centroid and (c) mutual exclusion clustering.

The first step, Generation of Daylight Fingerprints, generates Fingerprints for each molecule in ASCII format in form of 0 and 1, using Daylight software, while the second step identifies the central molecule in each cluster (centroid). To determine the centroid, step 2 uses the specified Tanimoto similarity value to determine the number of neighbors of each molecule and arranges it in descending order, so that the molecule with the largest number of neighbors will be on top of the list, which will be the first centroid. Finally, step 3 uses different iteration to determine the members of various clusters. It does this by computing the pairwise Tanimoto similarity value of the centroid molecule and other molecules. If the pairwise Tanimoto similarity value is greater than or equal to the Tanimoto value that is used for the clustering, the molecule is taken as a member of the cluster and removed from the list. The next molecule in the list is taken as a centroid, the iteration continues. Any molecule that is still in the list at the end of the process is regarded as a singleton. The result of the clustering algorithm is illustrated in Fig. 5.

Fig. 5
figure 5

Result of the clustering algorithm

In Fig. 5, the members of cluster A have been collected together based on the pairwise Tanimoto value with centroid molecule colored red. Similarly, the members of cluster B have been collected together based on the pairwise Tanimoto value with the centroid molecule colored green. The molecule colored yellow is the singleton [73].

However, most clustering algorithms have been implemented as software tools. One such software tool is ChemMine Tools, which is an online portal with the capability for some cheminformatic functions, like search, visualization, clustering, etc. [7]. ChemMine Tools provides five major functionalities, which include: data visualization, structure comparisons, similarity searching, compound clustering and prediction of chemical properties [7]. The similarity toolbox of ChemMine implements an algorithm that uses atom pairs as a structural descriptor and the widely used Tanimoto coefficient as a similarity measure to compute similarity measures among compounds. Another feature of ChemMine is that it allows the use of other similarity coefficients like Tversky or Dice [7]. Furthermore, the clustering toolkit of ChemMine implements three clustering algorithms, which are: hierarchical clustering, Multi-Dimensional Scaling (MDS) and binning clustering [7]. Clustering by structural similarity requires that the similarity measure be computed by first generating the atom pair descriptors (features) for each compound, which is used to calculate the similarity matrix using the Tanimoto coefficient. While hierarchical clustering organizes the compounds by similarity using a tree structure, the MDS outputs the similarity information in a scatter plot. Though both methods do not assign the compounds to discrete similar groups, the assignment to a similar group is done later in the clustering process, using various post-processing approaches, like the tree cutting method [7]. On the other hand, the binning method clustering provides the clustering groups using a user-defined similarity measure cut-off. The method allows the user to choose a similarity cut-off; afterwards, compounds that have a similarity measure that is greater than or equal to the chosen similarity value will be assigned into groups [7].

In addition to ChemMine as a software tool for analyzing chemical compounds dataset, there are other software tools with additional clustering functionalities; one of such is ChemmineR [81].

ML as a classification tool

A review of relevant literature showed some studies that applied ML approaches to predict activity against P. falciparum. In this section, relevant classification models were reviewed; six reviews identified SVM as the best classification tool, four report identified Random Forest as the best modeling tools while the other eleven modeling tools were also identified as shown in Fig. 6B

Fig. 6
figure 6

Classification of objects into two different categories (A) and distribution of the best performing ML algorithms based on the relevant articles reviewed (B). In (a), datasets are classified based on similarity (blue or red color, circular or triangular shape) or dissimilarity (blue and red colors or circular and triangular shapes); Deep Learning (DL), Boosted Trees Regression (BTR), J48 classifier (JC), Discriminant Functions (DF), XGBoost (XGB), Graph Convolutional Neural Networks (GCNN), Multilinear Regression (MLR), General Regression Neural Network (GRNN), Multivariate Analysis (MVA), C5.0 and Artificial Neural Networks (ANN) represent (X), Random Forest (RF) represents (Y), and Support Vector Machine (SVM) represents (Z)

RF algorithm

A study on varied drug-decorated nanoparticles organic compound/drug complexes used eight ML classifiers to predict activity against P. falciparum [8]. The dataset was based on 107 input features and 249,992 compounds, and the best model was RF (27 selected features) with a mean area under the Receiver Operating Characteristic curve (ROC) a value of 0.9921 _ 0.000244 (tenfold cross-validation) which is statistically significant. Janairo et al.introduced a 20 chemical descriptors predictive model (ML) employed to establish a relationship between the mosquito repellent activity of 33 natural compounds using four classifiers. The optimized model through BTR (best performed) demonstrated a good predictive ability (r2 train = 0.93, r2 test = 0.66, r2 overall = 0.87) than other ML applied [82].

The RF algorithm showed a lower overall accuracy of 0.75 in a QSAR study involving 323,201 compounds to identify the biological activity of new antimalarial against the apicoplast in P. falciparum with 179 descriptors [83]. The regression analysis showed an AUC of 70%, specificity of 80% and a sensitivity of 40–50%. Egieyeh et al. [84] applied four ML algorithms (Optimization of SVMs, Naïve Bayesian, Voted Perceptron, Sequence Minimization and RF) on QSAR of 1155 natural products with an in vitro antiplasmodial activity using 76 descriptors. With an accuracy of 82.8% and an AUC of 0.91, this study appeared to be better predictive than the previous study [83] and could be attributed to the outrageous number of descriptors or poor correlation used in the former [83]. A study developed and evaluated a 97 QSAR model of 16 datasets to generate a predicted profile in bioactivity and cytotoxicity using different approaches (e.g., conformal prediction framework) to improve the prediction accuracy of models [85]. The result was evaluated by modeling the dataset with and without the addition of the predicted continuous bioactivity profiles; the efficacies of the final models improved with the addition of the predicted continuous bioactivity profiles.

SVM algorithm

The SVM is another ML algorithm used in support vector classification to find a hyperplane in both classification and regression analyses [86,87,88]. This algorithm has been applied in regression analysis for the prediction of biological activity against P. falciparum. In a study, both linear and nonlinear SVM algorithms were built to classify 999 compounds (inhibitors and non-inhibitors) for anti-proliferative activity against P. falciparum using 383 descriptors [12]. The statistical validation showed performance with an accuracy of 83% and an AUC of 0.88. The predictive power of the optimized model shows that it may be effective in selecting potential hits in screening large libraries. A dataset of ~ 4750 compounds with activity against P. falciparum was subjected to four ML algorithms (SVM, RF, kNN and XGB) with 98 descriptors [89]. Both SVM and XGB performed better with ~ 85% on the independent test set. This finding further supported the work of [12] that the built models are efficient and may be potentially useful for facilitating the discovery of antimalarial agents [12]. With a slightly higher SVM prediction accuracy (R2 training 8.95 and R2 test 8.73), a study discovered a good 2D-QSAR model in a study involving 4750 compounds to identify antimalarial activities against P. falciparum using 15 descriptors [90]. The study also showed that GRNN prediction accuracies of 99.7% for the training set (3887 compounds) and 88.9% for the test set (863 compounds). Similarly, a study evaluated 116,987 antimalarial compounds against apicoplast formation using 173 descriptors [91]. The R-caret package employed different algorithms for the predictive model building including Generalized Linear Model (GLM), kNN), SVM, RF and C5.0 decision tree. The model validation showed that C5.0 and SVM and RBF outperform others. The modeling of 277 P. falciparum proliferation inhibitors and non-inhibitors with SVM using various descriptors showed 87% overall accuracy and an AUC of 0.73 [92].

QSAR ML algorithms

A structural descriptors-based QSAR model of anti-Plasmodium liver stages bioactivity and prediction of physicochemical parameters influencing intestinal absorption for 127 compounds have been reported [93]. Seventeen drugs that were predicted to be active or inactive were selected for testing against the hepatic stage of P. yoelii in vitro. Antiretroviral, antifungal and cardiotonic drugs were found to be highly active (nanomolar 50% inhibitory concentration values), and two ionophores completely inhibited parasite development. The most active compounds against the hepatic stages of P. yoelii yoelii and P. falciparum were monensin and nigericin, with IC50 of 10.3 nM, and the analysis was used to categorize the compounds into highly active, active and inactive groups according to their 50% inhibitory concentrations (IC50). A more comprehensive MLR 2D-QSAR model to predict anti-P. falciparum activity of two datasets of organic compounds, each with an R2 of 0.84 and 0.89, has been demonstrated [94]. In addition to MLR, Santos et al. [95] had used 230 descriptors, PLS and PCR analysis to describe the QSAR of artemisinin and 20 derivatives and further predicted the antimalarial activity of 30 new artemisinin compounds unknown activity showing high statistical significance. A higher dataset of 72 compounds with lower descriptors (39) was applied to build a QSAR model against the 3D7 P. falciparum strain, which identified 31 potential antimalarial compounds [6]. Interestingly, another study demonstrated a 2D-QSAR model of 3133 compounds using 929 descriptors in which the study showed abysmal 14.2% accuracy [96]. A similar study applied Artificial Neural Networks with Levenberg–Marquardt algorithm (non-linear approach) on the anti-malarial activity of a set of 33 imidazolopiperazine compounds against 3D7 and W2 strains [97]. Results showed the potential of the suggested model for the prediction of 3D7 activity and more acceptable than W2 strain with R2train = 0.947, R2val = 0.959, R2test = 0.920. The results of R2, MSE and leverage value showed that the prediction ability of the ANN method for estimation of the anti-malarial activity in imidazolopiperazine compounds is good and can be used as a virtual tool molecule to design more efficient compounds with activity against malaria (3D7 and W2 strains). An integrated application of ML algorithms, CoMFA analyses and molecular docking methods on a set of 228 known triclosan and rhodanine inhibitors of P. falciparum enoyl acyl carrier protein reductase (PfENR) of potential antimalarial agents targeted to PfENR yielded accuracies for the training set and evaluation set are 94.18 and 57.14% for IB1 and 92.80 and 68.57% for Kstar, respectively [77]. Neves et al. [78] adopted deep learning to build binary and continuous 2D RDKit descriptors QSAR models based on large datasets for predicting the antiplasmodial activity and cytotoxicity of 413,855 untested compounds. The developed computational models were used to prioritize novel, active, and nontoxic compounds from virtual chemical libraries for experimental evaluation. Similarly, a researcher had developed an ML-based QSAR model to predict which molecules will block the malaria parasite's ion pump, PfATP4 [98]. The model was then employed to screen and classify the DrugBank database molecules and compounds coming from a proprietary marine molecules library.

Other ML algorithms

A deep learning-based algorithm (DeepMalaria) for anti-P. falciparum activities features of 13,446 compounds and 23 descriptors was demonstrated using their SMILES [99]. The algorithm predicted 72.3% of active compounds from the validation dataset and 87.8% of that of the test dataset with acceptable accuracy in an imbalanced setting showing significant predictive potentials to improve drug design and development. A study] reported a systematic review on the green synthesis of metal nanoparticles as a potential source of new antiplasmodial drugs [100]. Seven electronic databases and 17 papers were included in the review. A very high proportion of the studies (82.4%) used plant leaves to produce nanoparticles (NPs) while three studies used microorganisms, including bacteria and fungi.

ML as a regression tool

There have been many reports on the use of regression analysis as an ML tool using different ML algorithms such as deep learning, Random Forest (RF), Boosted Trees Regression (BT), J48 classifier, DF2, SVM, XG boost, GCNN, Multilinear Regression (MR), GRNN, C5.0, ANN among others. In this regard, the regression tool is a predictive computational model that enables one to understand the correlation between chemical properties (descriptors) and their activities, i.e., to computationally screen large molecular datasets thereby offering a possibility to improve the hit rate and thereby reduce the overall costs of drug discovery. This has been applied extensively in the drug discovery of anti-P. falciparum drug. Careful analysis of the result reported by different authors shows that efficient computational predictive models help to screen large datasets in silico and could be potentially used to prioritize molecules for high-throughput screens.

In a multi-parametric QSAR study to predict IC50 and Log P for 5-N-acetyl-β-D-neuraminic acid, consisting of 110 training sets and 50 test sets of compounds structurally related to 5-N-acetyl-β-D-neuraminic acid, polyAnalyst was used to develop the linear model using a stepwise linear regression algorithm [101]. The predicted IC50 values provide good statistical measures for the correlation coefficient, standard deviation and standard error as 0.8545, 0.2932 and 0.3815, respectively. Importantly, the model showed that a strong correlation exists between Log P and IC50 of drug compounds. A dataset of 34 compounds and 8 descriptors was subjected to MLR analysis to construct a QSAR. This method produced higher R2 (0.9714–0.9909) and RMSEP of 0.0938 and 0.1819 compared with the method of Pushpa and co-workers [101]. In drug discovery of the transmission-blocking potential of 44 anti-malarial compounds in the mosquito feeding assay using P. falciparum male gamete inhibition assay, [102] applied regression tool. Root Mean Square Error (RMSE) of 22.51% was obtained from the measured relationship between exflagellation inhibition (EI) and oocyst reduction [102]. The model provided pIC50 predictions in SMFA with high accuracy, and IC50 values for 11 compounds obtained in the exflagellation inhibition assay were correlated with IC50 values in SMFA. Significantly, the result of the regression models gave IC50 predictions results in SMFA that had high accuracy. However, it was stated that the small dataset (n = 44) used to build the model may render the result unreliable.

ML has also been successfully applied in epidemiological studies of malaria [103, 104]. The outbreak of malaria using six observed variables; a dataset of thirty-eight compounds collected from malaria samples of Maharashtra State with eight descriptors was used [105]. To determine the performance of the model, logistic regression, random decision trees and Gaussian processes were used. The regression model as well as the Decision tree and Gaussian models was able to give 100% accuracy in predicting malaria outbreaks [106]. A combination of 8 ML algorithms (KNN, SVM, SVM linear, linear regression, linear discriminant analysis, DT and RF classifiers) to predict the effect of compound/drug reactions that have antimalarial activity against Plasmodium has been documented [8]. Findings showed that Random Forest classifiers gave a more accurate result than other learning algorithms.

Importantly, the top six ML algorithms—simple linear regression model, lasso, logistic regression, Support Vector Machines, multivariate regression algorithm and multiple regression algorithm—are commonly used in data mining and their applications in industry are well known.

Expert opinion and prospects

The available epidemiological data show that malaria, no doubt, is a disease of today and the future despite huge investment toward vaccine development and drug discovery [107]. Available drugs have been overwhelmed by Plasmodium resistance and poor pharmacokinetic-related limitations; this calls for an urgent need to explore more approaches. Natural products-derived compounds, synthetic and several modification attempts on existing drugs have not yielded the desired products. Despite huge deposits of potential antimalarial compounds in various databases, none has been transformed from such virtual spaces to the bedside; perhaps the goldmine strategies are yet to be exploited. The target of pharmaceutical and drug discovery scientists has always been to discover and develop new drugs that will ultimately benefit the patient within the shortest possible time and at an affordable cost. ML is a developing trend in the drug discovery industry. It is expected to revolutionize the drug discovery process by introducing efficiency that will lead to the discovery of new drugs at a shorter time and at a lower cost.

ML in drug discovery has come to stay and its application in the discovery of anti-Plasmodium species drugs is emerging. The quagmire now is whether a game-changing ML approach has been explored, exploited or adopted. That is the crux of this systematic review. Several known ML algorithms have been applied in anti-Plasmodium species drug discovery which resulted in acceptable statistical significance measures. With ML, the biodiversity which has been under threat because of several drug discovery programs can be conserved or handled with a more precise approach. There is a need for scaling down the ML technology to early-career DDD scientists so that soon, the tools used by ML specialists will become a norm in laboratories involved in drug discovery and development. ML and its tools could also find use in downstream processing of pharmaceuticals where current good manufacturing practices are expected to be religiously followed to ensure the production of consistently high-quality medicines that will meet regulators’ specifications.

Several ML tools were reviewed in this paper. Careful analysis of the literature reviewed in this paper indicated that Support Vector Machine (SVM) was the most highly favored tool followed by Random Forest. SVM has been widely applied in biological and other sciences with high accuracy. However, other machine learning tools identified in this study have been sparingly used and could serve as a good starting point for the discovery of game-changing antimalarial drugs. It is thus expected that the application of this ML tool or its modification in the discovery of antimalarial and other drugs will progress rapidly in the coming years considering the urgency needed for the discovery of new anti-infectives required to meet the healthcare needs of countries in the endemic regions of the world.

Conclusion

Malaria can be eradicated in sub-Saharan Africa by the combination of chemotherapy and chemoprevention. The emergence of resistance has continued to hamper chemotherapeutic approaches. However, emerging drug discovery methods such as ML have continued to show potential for new molecules capable of circumventing many known challenges. Until total eradication is achieved, the search for vaccines and cures will continue to receive attention. Our team is currently working on the application of various ML algorithms to the discovery of potent, safe, affordable and deliverable molecules against P. falciparum.