Introduction

Cancer is one of the leading causes of death around the globe. According to the global cancer statistics 2020, cancer cases have risen to 19.3 million new cases and 10 million cancer-related deaths. It is estimated that the worldwide cancer burden stands to rise by an astounding 47% to 28.4 million cases by 2040 [1]. Although a wide range of techniques like chemotherapy, radiation therapy, and immunotherapy are currently available to combat cancer, there still exists a need to find more diverse sources of new chemical entities (NCEs). While chemotherapy is currently the most reliable source of cancer treatment, it is known to be accompanied by adverse reactions, like toxic effects on non-targeted tissues, relapse, and drug resistance. Drugs designed from natural products (NPs) aim to counter several disadvantages of synthetic compounds and traditional chemotherapy techniques [2]. Molecular scaffolds of NPs show high chemical diversity making them suitable lead-like compounds for drug development [3]. Moreover, screening of NPs can lead to the addition of novel molecules to current libraries, which can then be checked on pharmacokinetic parameters of drug discovery [4,5,6,7,8]. Since most NPs are naturally bioactive thus, they have the potential to be used more effectively in the process of drug development. 50% of the known anti-cancer drugs are derived from NPs [9] among which plant-based NPs have been historically used [10, 11]. These compounds are either isolated phytomolecules or semi-synthetic molecules of natural origin [12, 13]. Also, different classes of plant-derived NPs have been tested against a number of cancer cell lines and their anti-cancer activity has been reported in the literature [14]. These molecules majorly belong to terpenoids, flavonoids, and alkaloids categories, as shown below (Fig. 1).

Fig. 1
figure 1

Different classes of phytomolecules with known anti-cancer evidence

The boom in the amount of biological information generated has led to the age of biological big data [15]. Accordingly, developing more and more computational resources to analyse the molecular features and chemical behaviour of NPs from an in silico perspective has gained increased importance. Various chemoinformatics and bioinformatics approaches such as structure and ligand-based techniques, pharmacophore modelling [16], virtual screening (VS) [17, 18], Quantitative structure–activity relationship (QSAR) [19], and network pharmacology are currently being used for the analysis and interpretation of these data. This has in turn led to the compilation of multiple libraries of NPs with potential lead-like properties [7, 14]. However, despite the availability of substantial amounts of chemicobiological data and traditional plant-based anti-cancer remedies, there remains a lacuna in the domain of plant-based anti-cancer drug development. Thus, to usher in a new generation of NP-based drug discovery, machine learning (ML), deep learning (DL), and artificial intelligence (AI) have provided a plethora of functionalities that may play an important role in the drug discovery pipeline [20,21,22,23]. However, traditional drug discovery techniques are still more widespread compared to the newer generation of techniques (Fig. 2).

Fig. 2
figure 2

Current scenario of anti-cancer drug discovery

In the past 5 years, a number of reviews in this domain have been published, but they either tend to emphasise solely on the anti-cancer efficacy and mechanism of action of different phytomolecules [9, 24,25,26] or on the current practices of drug discovery and development [27,28,29,30]. The current review has therefore attempted to focus on studies that use plant-based molecules for anti-cancer drug discovery with the help of various in silico tools and techniques so that a better understanding can be achieved concerning the current advancements and approaches available. Owing to the role of the current trends of data-driven drug discovery and development, we have added a section specifically to address the use of ML in NP-based drug discovery. We have highlighted the recent advancements in the field of cancer drug discovery, emphasising the use of various plant-based compounds that have proven anti-cancer properties as well as novel lead compounds that have shown positive results in bioactivity studies. Finally, we conclude with an overview of the different tools and databases currently in use that endorse the vast diversity of plant-derived NPs, NP-based applications, and their subsequent uses in NP-based drug discovery.

Computational approaches used for natural product drug discovery

In the field of computer-aided drug discovery (CADD), the use of structure and ligand-based approaches have been ongoing, depending on the availability of crystallised 3-dimensional (3D) structure of the receptor molecule in the Protein Data Bank (PDB) [31]. For ease of understanding, we have categorised the different computational techniques under these two sub-categories and discussed in detail how they have been used to identify the plant-derived NPs against different cancer targets.

Structure and ligand-based drug discovery

In structure-based drug design, the active site of the macromolecule structure, the presence of specific amino acids in the binding pockets, and the strength of interaction of the reacting species are all considered during lead identification. Structure-based approaches like molecular docking and VS techniques have facilitated the fast and cost-effective analysis of large sets of compound databases to identify the potential hit molecules [17, 18, 32]. For selecting a hit compound, many molecules are screened in order to identify the molecules having desired activity against a target with a specific set of chemical and structural features [33]. Ligand-based drug design (LBDD), on the other hand, is used in the absence of relevant 3D structural information about the receptor molecule and is dependent on the information about the different small molecule ligands which are biologically active in a specific disease or against a drug target. QSAR and pharmacophore modelling are the most prominent techniques used in LBDD to identify and design novel inhibitors.

Virtual screening (VS) and molecular dynamics (MD)

VS is a highly effective tool for discovering new active compounds in drug discovery [34, 35]. MD, on the other hand, helps establish the strength and stability of the interactions in the protein–ligand complex [36, 37]. VS has been of help in the domain of drug development to screen libraries for identifying compounds that depict high binding affinities. This time-efficient technique not only helps in dealing with large datasets but also avoids late-stage drug development failures as it narrows down the number of compounds to be evaluated in biological assays. Unlike the static conformations used in classical models of drug development studies, the biomolecules in the human body are constantly in motion. Thus, it is necessary to develop an understanding of the changing molecular structures of receptors for designing better approaches to drug development. MD predicts the molecular and structural changes of biomolecules under the influence of inter and intramolecular forces and therefore is essential for drug development studies. In the area of NP drug discovery, VS has been extensively used to discover molecules against different cancer targets. Examples of studies published in this area that uses VS and MD to identify the plant-based compounds as leads are listed in Table 1, a few of which are discussed in detail below.

Table 1 Studies listing different drug discovery techniques used for the identification of NP-based anti-cancer molecules

Muhseen et al. [38] used VS, absorption, distribution, metabolism, excretion, and toxicity (ADMET) filtering and MD to discover potential inhibitors against Mouse double minute 2 homolog (MDM2) from terpenoid compounds. They have utilised (a) pharmacophore model-based VS and (b) tanimoto coefficient-based screening using co-crystal ligand, to identify the compounds with similar activity. Three terpenes, 3-trans-p-coumaroyl maslinic acid, silvestrol, and betulonic acid, were thereafter identified as potential inhibitors of the p53–MDM2 interaction (Fig. 3A–C).

Fig. 3
figure 3

Structures of A 3-trans-p-coumaroyl maslinic acid, B Silvestrol C Betulonic acid, D ZINC00936598, E ZINC01020370, F ZINC00869973, G ZINC02578057, H ZINC03935485, I ZINC05013091, J ST055650, K ZINC03935485, L ZINC05013091, M ZINC35465964

Zarezade et al. [36] screened ~ 20,000 NPs from ZINC [39] to identify the potential candidates for progesterone receptor (PR) inhibition, which is one of the key targets in breast cancer. The X-ray crystal structure of PR was obtained from PDB, and the dataset of NPs was virtually screened to select top 200 compounds with the best docking scores. After ADMET screening of these molecules, 56 drug-like molecules were identified and subjected to the MTiOpenScreen [35] web server to identify 10 compounds with the best estimated binding energy. These PR antagonists were then redocked using AutoDock v4.2 [40] and based on the calculated binding energy, inhibitory constant (Ki), and the key residues involved in the interactions, three NPs were identified as inhibitors (ZINC00936598, ZINC01020370, and ZINC00869973) (Fig. 3D–F). Among these three compounds, ZINC00936598 had the highest binding energy and the lowest Ki. These compounds were then subjected to 100-ns MD simulation to examine the stability of the protein–ligand complexes. H-bond analysis demonstrated that the lead compounds in complex with PR preserve stability during the simulation period. Comparative H-bond analysis identified that the lead compound ZINC00936598 was more potent than the other two molecules and was identified as the best PR antagonist for breast cancer treatment.

A primary reason for metastasis is the degradation of basement membrane mediated by Matrix Metalloproteinase (MMP9). Biswas et al. [41] used ~ 14,000 NPs from ZINC to identify the potential intercalating agents that inhibit the interaction between tumour necrosis factor receptor-associated factor 6 (TRAF6) and Basigin (BSG) in order to control MMP9 overexpression. The BSG–TRAF6 complex was prepared using Zdock [42] protocol, following which the natural compound dataset was virtually screened using Genetic Optimisation for Ligand Docking (GOLD) software [43]. The top 20 compounds identified by GOLD score and Chemscore were then checked for their ADMET properties using various parameters and the best 3 molecules were selected. These 3 molecules were thereafter redocked (blind docking) using PatchDock [44] to validate their interaction with the target molecule complex. To further elucidate the effect of binding of these molecules and check the dynamic flexibility of the BSG-TRAF6 complex, a 60-ns MD simulation was performed. This led to the identification of ZINC02578057 (Fig. 3G) as the best inhibitor of the TRAF6 complex. Therefore, the authors proposed this molecule as a potential inhibitor of the BSG-TRAF6 complex to control the overexpression of MMP9 in melanoma.

In another study, Jairajpuri et al. [17] performed VS of a set of ~ 33,000 ADMET-filtered NPs from ZINC database against Sphingosine kinase 1 (SphK1). A list of ten hit compounds were identified based on their binding affinities to SphK1. They were then further subjected to docking analysis to observe all possible docked conformers of the complexes. Finally, two compounds ZINC05434006 and ZINC04260971 (Fig. 3H, I) were proposed as inhibitors based on their interactions with the substrate binding site of SphK1.

Raj et al. [32] studied flavonoid-based molecules to find novel inhibitors for the Bromodomain-containing protein 4 (BRD4). The active site was identified using literature evidence as well as SiteMap [45]. VS was performed for 500 Flavonoids and 4000 extended flavonoids obtained from TimTec database (https://www.timtec.net/) and several known inhibitors like Ms435, Bromosporine, and CPI203 were also docked as reference. A three-tier docking approach was taken using the HTVS, SP, XP features of Glide [46] where at each stage the top 10% of the compounds were taken further. Finally, the top-ranked flavonoid compound was identified as ST055650 (Fig. 3J) which had high scores in both Glide and AutoDock docking analyses. Subsequently, the ADMET and drug-likeness of this molecule were also checked and a 50-ns MD simulation was performed to validate the stability of the ligand molecule in complex with the target protein. All subsequent analyses of MD trajectory and interactions highlighted ST055650 as a potential candidate for BRD4 inhibition.

The double mutated Epidermal Growth Factor Receptor (EGFR) is an important and well-known target in lung cancer that demonstrates resistance to the existing drugs. Therefore, Agarwal et al. [37] addressed this clinical problem using an integration of ADMET, ML, VS, and MD to identify the new NP inhibitors for the mutant protein. A total of 1,52,056 naturally occurring small molecules from 12 different NP databases were retrieved and their drug-likeness were evaluated using Lipinski’s rule of 5 and Ghose filter [47]. The 74,673 molecules that passed ADMET filtering were thereafter subjected to a random forest-based binary ML classification model (NPred) [20] to evaluate their anti-cancer potential. 4681 potential anti-cancerous molecules with NPred score > 0.7 were then subjected to VS against the EGFR mutant crystal structure (PDB ID: 5EDQ) using FLEXX-PHARM [48]. As a result of the constraint docking approach, 1339 molecules were docked and finally 3 molecules having lowest binding free energy were identified. 100-ns MD simulations of the top 3 ligands (ZINC03935485, ZINC05013091, ZINC35465964) (Fig. 3K–M) showed that all the three inhibitors interact with Gln791 and Met793 in a conserved manner, similar to the co-crystal ligand. The study concluded with the identification of three naturally occurring molecules as potent inhibitors of the double mutated EGFR protein.

Pharmacophore modelling

The pharmacophore modelling technique is important for core/scaffold hopping and hit or lead optimisation. The validated pharmacophore models are used to search for bioactive molecules from compound databases, containing a large number of structures. Some of the highly used pharmacophoric features are, hydrogen bond acceptor (HBA), hydrogen bond donor (HBD), hydrophobic (HY), ring aromatic (RA), and others such as positive ionisable and negative ionisable structures. One of the key features of using pharmacophore-based techniques is that it is effective both in the presence and absence of receptor structure. By generating pharmacophores, it is possible to identify the specific interacting species of a molecule that are crucial for either eliciting any biological response or blocking it [49]. Along with the chemical features of any bioactive compound, their biological features are also highlighted by pharmacophores. Based on the type of molecule used to generate the pharmacophore, the entire process can be classified into two subtypes: ligand-based and structure-based pharmacophore modelling. In the ligand-based methods, the knowledge of known active compounds is used for pharmacophore generation. The structure-based approach on the other hand requires a bound receptor–ligand complex to elucidate the underlying interactions of the binding pocket that can be used to generate the pharmacophores [50]. The details of a few studies published in this area in the last five years have been compiled and provided in Table 1. These studies highlight the use of pharmacophore-based approaches for the identification of phytomolecules having potential anti-cancer activities. A few of these case studies have also been discussed below.

Complex-based pharmacophore generation techniques were used by Babu et al. [51] for studying natural compound derived inhibitors against seven therapeutic targets of gastric cancer. The three flavonoids 5-hydroxy-7,4′-dimethoxy-6,8-di-C-methylflavone, kaempferol-3-O-β-d-glucopyranoside (Fig. 4A) and kaempferol-3-O-α-l-rhamnopyranoside (Fig. 4B) from the fruits of the medicinal plant Syzygium alternifolium were isolated and their anti-gastric cancer activity was checked against the AGS cell line. Using molecular docking studies, it was found that Human Epidermal Growth Factor Receptor 2 (HER2) showed maximum affinity with these ligands. Thus, the docking poses of these 3 compounds in the active site of HER2 were utilised to generate pharmacophore models. The pharmacophore model generated for the best complex (selected based on cytotoxic profile and docking score) was used as a query in the ZINC database to identify the compounds with 90% similarity to the best compound. After subsequent docking and MD studies, the three isolated compounds as well as ZINC67903192 (Fig. 4C) were identified as promising HER2 inhibitors against gastric cancer.

Fig. 4
figure 4

Structures of A Kaempferol-3-O-β-d-glucopyranoside, B Kaempferol-3-O-α-l-rhamnopyranoside, C ZINC67903192, D Quercetin, E Curcumin, F ZINC85643856, G ZINC85646292, H ZINC96221218, I ZINC00241889, J ZINC11866307, K ZINC38143676, L SR84, M SR300, N SR413, O SR823, P SR530, Q ZINC20392430, R SN00112175, S SN00004718, T SN00262261

Bommu et al. [34] had used ligand-based pharmacophore techniques to identify the potential inhibitors for EGFR, which is an important target in non-small cell lung cancer (NSCLC). The authors have used the plant-based NP quercetin as a template for ligand-based VS and identified 100 analogs with similar pharmacological properties to quercetin (Fig. 4D). Molecular docking of the 100 quercetin analogs in the binding pocket of EGFR led to the identification of 10 molecules based on their binding affinities. Pharmacophore modelling of these 10 lead molecules was followed by the generation of a merged pharmacophore to identify the important features that contribute towards the better binding efficiency of these molecules. Further analysis of the drug-likeness and ADMET properties of these molecules revealed that the molecules follow ADMET properties and therefore can be considered for drug development.

In a few studies, different types of pharmacophore models are often used in conjunction, like the common feature and receptor-based pharmacophore models generated by Rampogu et al. [16] for the DEAD box protein 3 (DDX3). DDX3 is overexpressed in a variety of cancers like breast, colorectal, liver, lung and oral cancers as it plays important role in cancer progression, proliferation and transformation. VS of the above two pharmacophore models against ADMET screened compounds of InterBioScreen (https://www.ibscreen.com/) led to the identification of 17 compounds that were common to both these pharmacophore models. Using binding affinity information derived from molecular docking of these 17 compounds, curcumin (Fig. 4E) was identified as the best lead molecule against DDX3 based on binding affinity as well as key residue interactions and MD simulation.

Singh et al. [52] performed pharmacophore-guided activity profiling of NPs from the ZINC database for five different tyrosine kinase receptor targets of lung cancer i.e. EGFR, tyrosine-protein kinase Met (cMET), erb-b2 receptor tyrosine kinase 2 (ErbB2), Fibroblast Growth Factor Receptor (FGFR) and Anaplastic Lymphoma Kinase (ALK). For these five tyrosine kinases, an exhaustive literature search was performed to identify the appropriate molecular datasets which were then used for the generation of ligand-based pharmacophore models. The Catalyst module of Discovery studio was then used to create hypogen-based 3D pharmacophores which were verified using the parameters of sensitivity, specificity, and receiver operating characteristic (ROC) curve. The ZINC, NP catalogue was then screened by using the pharmacophore models of these 5 targets. Cross-selection and sorting was done based on fitness score which resulted in the identification of 10 molecules against the respective target proteins. These 10 molecules were then docked to the crystal structures of the 5 target proteins using CDOCKER. Finally, MD simulation of eight NPs (ZINC85643856, ZINC85646292, ZINC96221218, ZINC00241889, ZINC98365505, ZINC98364461, ZINC11866307, ZINC38143676) (Fig. 4F–K) with different proteins were performed and it was found that these ligands show several important interactions necessary for the inhibition of their respective protein targets.

Alamri et al. [53], had identified a potential novel sigma-2 inhibitor using pharmacophore and structure-based VS of a database consisting of natural and natural-like compounds. The key structural features of known ligands of the sigma-2 receptor were used to develop the pharmacophore model. The target sigma-2 protein was modelled by homology modelling through Iterative Threading ASSEmbly Refinement (I-TASSER) webserver [54]. The ligand-based pharmacophore was created using 19 diverse ligands having high binding affinity, obtained from Sigma-2 receptor selective Ligands Database (S2RSLDB) [55]. The pharmacophore model having the highest score was selected after quantitatively validating it with 100 actives and 425 decoys. The database of compounds was then screened by docking-based VS using Glide XP and the top 20 compounds were selected based on the binding energy score. Five lead compounds (SR84, SR300, SR413, SR823, SR530) (Fig. 4L–P) were selected afterwards and compared with the reference compound, CM366. Thereafter, ADMET properties and MD simulation for 50 ns was undertaken to validate five compounds as inhibitors of sigma-2 protein.

The cyclin-dependent kinase 7 (CDK7) is one of the most important members of the CDK family of genes/proteins which regulates cell cycle events. Several cancers, like hepatocellular, gastric, oral, breast, ovarian, pancreatic, and colorectal cancer, have reported overexpression of CDK7 which in turn has been linked with aggressive clinicopathological features and poor prognosis. Therefore, Kumar et al. [56] had implemented a ligand-based pharmacophore approach wherein a small set of selective inhibitors downloaded from PubChem [57] were used as the training set for hypothesis creation. These molecules were then used for generating ten pharmacophore models using the Hip-Hop algorithm. Thereafter, a structure-based pharmacophore model was also generated using CDK7 structure bound with THZ1 (PDB ID: 6XD3). Both the ligand and structure-based pharmacophore models were validated using the ROC curve and the Güner–Henry (GH) approach. Thereafter, 57,578 ADMET-filtered NPs from four databases (ZINC, SuperNatural II [58], ExiMed [https://eximedlab.com/], and InterBioScreen) were screened using the validated pharmacophore models. 197 compounds were selected based on both the pharmacophore models which were subjected to molecular docking using GOLD. 24 potential inhibitors had displayed better binding scores than the two reference molecules (CT7001 and THZ1) and were thus taken for 50-ns MD simulation studies. The results of the MD were analysed to select four NPs (ZINC20392430, SN00112175, SN00004718, and SN00262261) (Fig. 4Q–T) that demonstrated better binding affinity than known inhibitors as well as increased stability in their docked poses.

QSAR

The QSAR technique comes under the ligand-based drug designing method and involves the building of mathematical models to find statistically significant correlations between chemical structures and biological properties like half-maximal inhibitory concentration (IC50), half-maximal effective concentration (EC50), and Ki [59]. A lot of advancements have been made in the field of QSAR in the last decade which has hugely increased the dimensionality of molecular descriptors (from 1D to nD) [60] and even with respect to extraction of co-relation between chemical structures and biological properties. QSAR-based techniques have been used in the identification of potential lead compounds for a wide variety of cancers. A few examples of how QSAR-based techniques have been used to identify the plant-based NPs against different cancers have been shown in Table 1 as well as some are discussed in detail below.

Alam and Khan [61] developed field-based 3D-QSAR models on the MCF7 breast cancer cell line to identify the anti-cancer effects of the natural triterpene maslinic acid and its analogs. The important features like average shape, hydrophobic regions, and electrostatic patterns of active compounds were extracted and mapped to virtually screen potential analogs. Field points-based descriptors were used in order to generate the 3D-QSAR model by aligning known active compounds onto identified pharmacophore templates. Ultimately a compound (P-902) (Fig. 5A) was then identified as the best for different breast cancer targets like Aldo-Keto Reductase family 1 member B10 (AKR1B10), Nuclear Receptor subfamily 3 group C member 1 (NR3C1), Prostaglandin-endoperoxide Synthase 2 (PTGS2), and HER2.

Fig. 5
figure 5

Structures of A P-902, B Lonchocarpin, C CID_301751, D CID_3372729, E T9, F B42, G GA-1, H Cardenolide

Lung cancer is the leading cause of cancer-related deaths worldwide and has been the focus of drug discovery research. Using 3D-QSAR, docking, flow cytometry, and gene expression, Chen et al. [62] studied the anti-cancer activity of the natural chalcone lonchocarpin (Fig. 5B). The authors found hydrophobic interactions to be the most influential factor for the anti-tumour efficacy of lonchocarpin. Subsequent molecular docking studies also showed that lonchocarpin binds stably to the BH3-binding groove of the anti-apoptotic protein BCL2. Using several in silico techniques including QSAR, it was demonstrated that lonchocarpin is a potentially useful natural agent for cancer treatment.

The natural compound group withanolide present in roots and leaves of Withania somnifera (Indian ginseng/Ashwagandha) was explored by Yadav et al. [63] for developing QSAR models. The models utilised information about the anti-proliferative activity of withanolide analogs against different human breast cancer cell lines (SK-Br-3 and MCF7/BUS). The model for the SK-Br-3 cell line showed a 93% correlation between activity and chemical descriptors in the training sets, while the model for MCF7/BUS cell line showed a 91% correlation. The cross-validation coefficient indicated 90% and 85% prediction accuracy of both models, respectively. The two highly active compounds (CID_301751 and CID_3372729) (Fig. 5C, D) identified showed higher anti-proliferative activity than the reference compounds 5-fluorouracil (5-FU) and camptothecin (CPT). Therefore, these compounds were subjected to molecular docking using Surflex-Dock [64] module of SYBYL-X 2.1 on the biological target, β-tubulin which showed favourable binding affinity, bioavailability and ADME properties.

Ursolic acid (UA) is a natural pentacyclic terpene that has promising anti-cancer properties. Thus, Yadav et al. [19] developed 3D QSAR models from 29 UA derivatives that can inhibit the T24 bladder cancer cell line. The 3D QSAR-based CoMFA models were developed using SYBYL-X 2.1 and used for the prediction of the bioavailability and bioactivity of different compounds. The screened compounds were subjected to pharmacokinetic evaluation equivalent to the standard anti-cancer drug doxorubicin. In order to understand the underlying molecular mechanism of action, molecular docking was also performed. The two predicted active compounds T9 and B42 (Fig. 5E, F) satisfied all standard screening protocols like Lipinski’s rule of 5 [65], PK/PD and toxicity parameters. Thus, T9 and B42 were proposed as promising leads against bladder cancer.

The bioactive triterpenoid glycyrrhizic acid (GA) can be obtained from the Indian medicinal herb Glycyrrhiza glabra. Shukla et al. [66] performed a QSAR study to observe the biological effects of GA and its derivatives against metastatic Triple-Negative Breast Cancer (TNBC) cell lines. Using a regression-based QSAR model, five novel GA derivatives were designed, synthesised and screened for in vitro activity in metastatic breast cancer cell line MDA-MB-231. The results highlighted novel derivative GA-1 (Fig. 5G) having a cytotoxic activity greater than that of its parent compound GA. Further, atomic property field (APF)-based 3D QSAR and subsequent molecular docking studies revealed that the C-30 carboxylic group of the novel derivative GA-1 was the most important factor for its anti-cancer activity.

Cardenolides (Fig. 5H) are a class of cardiac glycosides that have long been used as potent inhibitors of the Na+/K+-ATPase transmembrane protein, which is overexpressed in a variety of cancers like skin, kidney, and lung. Therefore, Meneses-Sagrero et al. [67] developed SAR/QSAR models using 58 cardenolides having anti-proliferative effects on the lung cancer cell line A549. For these 58 molecules, their structures were generated using ChemBioDraw Ultra and molecular descriptors were calculated using PaDEL [68]. Using linear regression, the dimensionality of the physicochemical descriptors was reduced and ultimately 62 descriptors were used to generate the mathematical models. The QSAR models developed were thereafter validated using random cross-validation and models with R2 > 0.7 were taken for external validation. As a result, the authors concluded that the addition of a sugar moiety at the C3 position of cardenolides positively affects its anti-proliferative effect on lung cancer cells.

ADMET screening

In recent years, several in silico techniques have been developed that allow rapid assessment of molecules for their ADMET properties. Through these techniques, drug efficacy and toxicity are monitored by taking into consideration the pharmacokinetics and pharmacodynamics of potential drug-like molecules [80]. It is important to take into consideration the ADMET properties of the molecules under investigation a priori so that late-stage failures in clinical trials may be avoided. Therefore, the prediction of ADMET properties through in silico/computational techniques has become the cornerstone for “good” drug discovery practices. Some databases are also available that provide the ADMET processed compounds directly for VS [5, 7, 8]. Studies exist that analyse the pharmacokinetic properties of plant-derived NPs [6] in diseases other than cancer and therefore have not been discussed in this review. The studies discussed below are all focused on the ADMET analyses of plant-based molecules with anti-cancer effects.

In silico prediction of ADMET properties of molecules has become significant in current times since these methods can lower the probability of late-stage drug development failure and enable the use of only few promising molecules for wet lab experimentation. Thus, there are several resources (databases and software) available for computing and estimating different ADMET properties of molecules, like QikProp, pkCSM [81], and DataWarrior [82]. Fatima et al. [8] utilised these tools to evaluate the pharmacokinetic potential of ~ 3000 phytomolecules from a wide variety of geographically diverse databases like Phytochemica [83], SerpentinaDB [84], SANCDB [85], NuBBEDB [86], respectively. After ADMET profiling of these compounds, their anti-cancer potential was also evaluated using literature-based experimental evidence and their activity against different cancer cell lines and protein targets was documented. Finally, 24 compounds were identified which showed the best ADMET behaviour that belonged to NuBBEDB, Phytochemica, and SANCDB databases. Additionally, a user-friendly database ADMET-BIS was also created wherein users can find details about the ADMET behaviour of these phytochemicals from Brazil, India, and South Africa.

In order to identify the drug-like anti-cancer compounds from the medicinal plants used in African Traditional Medicine (ATM), Ntie-Kang et al. [7] had examined the pharmacokinetic properties of ~ 400 anti-cancer compounds discovered from African flora (AfroCancer) and compared it with ~ 1500 anti-cancer compounds from Naturally Occurring Plant-based Anti-cancer Compound-Activity-Target (NPACT) [14]. Lipinski’s Rule of 5 [65] which is used for determining the physicochemical properties of drug-like molecules showed that no Lipinski violations were present in approximately 200 compounds in the AfroCancer dataset. A comparison of the ADMET properties of these datasets based on the star parameter of QikProp revealed that around 232 and 630 compounds of AfroCancer and NPACT, respectively, fall within the accepted range of 95% known drugs. The overall study demonstrates the pharmacokinetic potential of African anti-cancer plant molecules that can be taken forward for further experimental validation.

Sharma et al. [5] collected ~ 5000 anti-cancer phytomolecules from three open-access databases NPACT, CancerHSP [87], and TaxKB [88], which were then checked for their ADMET properties. The ADMET profiling of these compounds was done using QikProp with 13 parameters (namely Lipinski's Rule of 5, Jorgensen's Rule of 3, polar surface area, number of rotatable bonds, octanol–water partition coefficient [logP], affinity to plasma protein [logKHSA], number of likely metabolic reactions, number of reactive functional groups, number of amides, number of amines, number of acids, IC50 for blockage across K+ channels [loghERG] and transport across blood–brain barrier [logBB]). It was observed that 63% of the tested compounds were orally absorbable, 52% were distributable, 45% could be metabolised and excreted, and 28% were found to be non‐toxic for cardiotoxicity and central nervous system (CNS) activity. The authors have developed an interactive database, ADMETCan, which can be used to access the ADMET profile of these studied compounds. This resource provides information about plant-based molecules that have a higher chance of being properly eliminated from the body.

Selaginella repanda is an important ethnomedicinal plant used by certain Indian tribal communities for treating a wide variety of diseases and health conditions. Therefore, Adnan et al. [89] have tried to assess the anti-cancer properties of the S. repanda phytoconstituents. The crude extract of S. repanda was tested against the MCF7, HCT116 and A549 cell lines where it exhibited significant anti-proliferative activity against three cell lines. Subsequently, High-resolution liquid chromatography–mass spectrometry (HR-LC-MS) was performed on the above-mentioned crude extract of S. repanda and 54 phytochemicals from different classes like fatty acids, alcohols, sugars, flavonoids, alkaloids, terpenoids, coumarins, and phenolics were identified. For these molecules, the ADMET properties were computed using SwissADME, and it was found that all of them obey Lipinski’s rule of 5. Additionally, most of these molecules exhibited good bioavailability scores, moderate to good skin permeability and moderate to potent Caco-2 permeability. This allowed the identification of phytoconstituents of S. repanda that have anti-proliferative activity against different cancers and are adherent to ADMET parameters.

Next-generation drug discovery techniques

Alongside the traditional chemoinformatics-based approaches described earlier, the newer generation of drug discovery pipelines are utilising concepts like network pharmacology and ML in the drug discovery process [21, 90, 91]. AI has progressed from mostly theoretical studies to more real-world applications in recent years, including different stages of drug research and development. Here we have focused on how these techniques of ML and network biology have been utilised for the identification of NCEs and potential lead-like compounds from plant-based natural sources.

Network pharmacology-based approaches

The traditional approaches to drug development continue to provide reliable results, but they tend to follow the ‘one drug one target’ approach. As diseases like cancer are multifactorial, it has led to the development of the interdisciplinary field of network pharmacology, which falls at the intersection of network biology and polypharmacology [92]. Network pharmacology follows a system-based thinking wherein a complex network of biological pathways is broken into smaller, more understandable pieces, which can then be easily studied for further analysis. It relies on the concept of finding multi-targeted drugs, acting on multiple biological targets and pathways, thus yielding better therapeutic results [92]. Examples of studies that demonstrate how network pharmacology is being used to identify the important phytomolecules against different cancer targets are given in Table 2. A few of them are discussed in detail below.

Table 2 Studies that utilise Network pharmacology approaches for anti-cancer drug discovery

Shang et al. [90], had conducted a study on tetrandrine (Fig. 6A), a bis-benzylisoquinoline alkaloid extracted from Stephania tetrandra, a traditional Chinese medicine, to predict its molecular mechanisms against endometrial cancer. Seven key target genes were identified using protein–protein interaction (PPI) network analysis which were then found to be distributed in the PI3K/Akt signalling pathway using Kyoto Encyclopedia of Genes and Genomes (KEGG) [93]. Molecular docking and in vitro assays revealed that tetrandrine acted as a tumour suppressor by repressing BCL2 and promoting the activity of BCL2 associated X, apoptosis regulator (BAX) at the mRNA level. In most cancers, aberrant apoptotic signalling, notably the inactivation of apoptotic systems, enables cancer cells to evade programmed cell death, resulting in uncontrolled proliferation, tumour survival, treatment resistance, and cancer recurrence. Therefore, in the current study tetrandrine-induced apoptosis in endometrial cancer cells was established through BCL2 family proteins.

Fig. 6
figure 6

Structures of A tetrandrine, B isoquercitrin, C quercetin, D berberine, E chlorogenic acid, F caffeic acid, G emodin, H apigenin, I kaempferol, J linoleic acid, K methyl linoleic acid, L ferulic acid, M ursolic acid, N enterolactone, O parthenolide, P β-bourbonene

Fruits of Nandina domestica have been used as a traditional remedy for treating cancer in different countries. Taha et al. [70] analysed the extracts of this plant using Ultra-high performance liquid chromatography-MS/MS (UPLC-MS/MS). Subsequently, compound-target and compound-target-pathway networks were derived from the Search Tool for Interactions of Chemicals (STITCH) [94], Database for Annotation, Visualisation, and Integrated Discovery (DAVID) [95], KEGG, and Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) [96]. The computed data were then verified via in vitro experiments. Enrichment analyses, performed on the 22 compounds that passed UPLC-MS/MS, revealed the presence of 5 anti-cancer compounds (isoquercitrin, quercetin, berberine, chlorogenic acid, and caffeic acid) (Fig. 6B-F), and 4 molecular targets (AKT serine/threonine kinase 1 [AKT1], Caspase 3 [CASP3], Mitogen-Activated Protein Kinase 1 [MAPK1], and Tumour Protein p53 [TP53]). The authors further analysed the data to identify 15 cancer-related pathways for colorectal, endometrial, and non-small cell lung cancers. The results of the current study revealed that the above-mentioned five phytocompounds exhibit high synergistic interactions with important cancer-related targets and pathways.

The NP emodin (Fig. 6G), extracted from the plant Rheum palmatum has traditionally been used to treat cancer of the lungs, liver, and pancreas. Thus, Zhang et al. studied the therapeutic effects of this NP on the MCF7 cancer cells [97]. The targets of the compound were identified using PharmMapper [98], SwissTargetPrediction [99], and the Traditional Chinese Medicine Systems Pharmacology database and analysis platform (TCMSP) [100]. The authors then used this information to extract genomic information from the UniProt Knowledge database. A protein-target PPI network was thereafter generated using the STRING database, after which Gene Ontology (GO), and KEGG Pathway enrichment analyses were performed and visualised using the R software. It was elucidated that emodin exhibits anti-tumour activity by activating the AhR/CYP1A1 pathway and can be further exploited as a potential lead compound for the treatment of breast cancer.

Deng et al. [101] used the active components of the Platycodon grandiflorum (PG) roots to generate a “Drug-Ingredients-Targets-Pathways-Disease” (DITPD) network. This was done to explore the potential molecular mechanism used by PG against different cancers. Various triterpenoid saponins, steroidal saponins, flavonoids, phenolic acids, organic acids, and other compounds were identified from PG which were then screened for ADMET properties. Finally, apigenin, caffeic acid, kaempferol, linoleic acid, methyl linoleic acid, and ferulic acid (Fig. 6F, H–L) were identified as core compounds. The targets of these active components were obtained using the SwissTargetPrediction, PharmMapper, and TCMSP databases. Previously known disease targets for lung cancer were obtained and a PPI network was generated for these targets using STRING. Degree centrality (DC) was then calculated for this network by CentiScaPe to filter the top 20 targets with the highest DC value. Subsequent enrichment analysis was performed which revealed that these targets are closely related to different cancer pathways like the TNF, MAPK and PI3K-AKT signalling pathways. The top 10 annotated pathways were then used to construct a DITPD network in Cytoscape [102]. A network of 40 nodes (1 drug, 8 core component, 20 core targets, 10 pathways and one disease) was obtained, in which targets of the core ingredients of PG were found to be distributed in these pathways. Finally, a molecular docking study verified the interactions between the core components of PG (apigenin, caffeic acid, kaempferol, linoleic acid, methyl linoleic acid and ferulic acid) and the target molecules indicating a consistent relationship between NPs and the targets.

Breast cancer is the leading cause of death in women worldwide and therefore, Jha et al. [79] identified ADMET adherent phytomolecules against six of the most important breast cancer targets viz., PR (PDB ID: 4OAR), EGFR (PDB ID: 2J6M), mTOR (PDB ID: 4DRH), p53R2 (PDB ID: 3HF1), CTLA4 (PDB ID: 1DQT), and CDK8 (PDB ID: 6T41). The authors used Dr. Dukes Phytochemical and Ethnobotanical Database (https://phytochem.nal.usda.gov/phytochem/search) to acquire a list of 68 phytochemicals with established anti-cancer activities which were then evaluated for their Drug-likelihood using SwissADME [103] and ADMETlab 2.0 [104]. The CASTp server [105] was used to identify the active sites of each of the proteins after which 38 of the ADMET-filtered ligands were docked to these proteins using AutoDock Vina [106]. Molecular docking revealed that ursolic acid, enterolactone, parthenolide, and berberine (Fig. 6D, M–O) show increased binding affinity in comparison to the known drugs and therefore can be used to inhibit the breast cancer targets.

Among the members of the genus Ficus, the significance of Ficus carica L. is widespread because of the medicinal value of its leaves and roots. Gurung et al. [78] performed an extensive literature review and identified 68 bioactive compounds present in the leaves, roots and barks of F. carica. The physicochemical features of these NPs were checked in DataWarrior v4.6.1, which revealed that 13 of the 68 NPs were safe for use. Thereafter, 3D structures of six important protein targets of various cancers viz., CDK2 (PDB ID: 1DI8), CDK6 (PDB ID: 1XO2), Topoisomerase I (PDB ID: 1T8I), Topoisomerase II (PDB ID: 1ZXM), BCL2 (PDB ID: 2O2F), and VEGFR2 (PDB ID: 2OH4) were downloaded from PDB, prepared and taken for molecular docking using AutoDock v4.2. For each of the six proteins, the best bound NPs were studied using MD simulations in LARMD [107] and ultimately identified β-bourbonene (Fig. 6P) which exhibited strong binding with most of the cancer targets.

As cervical cancer is a major concern for developing nations, therefore Aarthy et al. [108] adopted a systems pharmacology-based multi-omics approach wherein they exploited the curative potential of plant-based NPs against Human Papilloma Virus (HPV) mediated cervical cancer. Firstly, the authors used the ArrayExpress database [109] (https://www.ebi.ac.uk/arrayexpress/) to retrieve human transcriptome datasets of cervical cancer patients. After manual curation of the data using Microsoft Excel, it was imported to Network Analyst 3.0 [110] (https://www.networkanalyst.ca/) wherein differential gene expression analysis was performed, and 384 immune-related genes were identified using the Limma statistical model with adjusted p value < 0.05. Additionally, network analyst was also used to perform over-representation analysis, functional enrichment and identify tissue-specific interactions. Secondly, 87 pharmacologically active constituents of the Indian plants Mangifera indica, Nigella sativa, Zingiber officinale, Citrus grandis, Ziziphus jujube, Ziziphus mauritiana and Cinnamomum cassia were identified from literature and web resources. The immune response related protein targets of the 87 NPs were thereafter retrieved from SwissTargetPrediction (http://swisstargetprediction.ch/index.php) which highlighted that 79 of the 87 NPs were significantly targeting 35 of the 384 differentially expressed genes (DEGs). Thirdly, Compound target networks (CTNs) were constructed for these 79 NPs and 35 DEGs using Cytoscape v3.8.0 in order to identify the key multi-target qualities of the identified phytocompounds. Additionally, Gene ontology, enrichment and molecular interactome analysis were also performed using GOnet [111] (https://tools.dice-database.org/GOnet/), Metascape (https://metascape.org/gp/index.html#/main/step1) and STRING Viruses v10.5 [112] (http://viruses.string-db.org/). The above-mentioned integrative approach thus highlighted the multi-targeted approach of plant-based NPs from the Indian subcontinent that are capable of tackling HPV-mediated cervical cancer.

Machine learning-based approaches and NP applications

ML techniques and algorithms are now extensively used for different aspects of the drug discovery and development process [128]. Large amounts of data and different algorithms are used to train the ML models such that they can learn how to perform tasks independent of manual programming. Depending on the learning process of algorithms, they are broadly categorised as either supervised or unsupervised learning. Supervised learning models are trained on both input as well as output data. The categorical or continuous relationships of labelled datasets are utilised for training the supervised ML models. The expected outcome of a model trained through supervised learning is the prediction of future outputs of known input data. However, in the case of unsupervised learning, no such labelled data (i.e. unlabelled) are given, which forces the model to identify the intrinsic patterns in the input data based on which the initial data are then transformed into several meaningful clusters independently (without supervision) [129]. Through the use of models trained on a large number of chemical compounds, efficient and cost-effective modules can be developed. These models utilise the exhaustive approaches of the ML algorithms to unearth the underlying relationships between experimental observations and chemical features of the molecule. The models are then, used to predict the experimental outcome using the chemical information of new molecules [128]. In drug discovery, the traditional approach is expensive and time-consuming. Owing to this, the use of ML tools in this field is steadily gaining momentum as an essential approach for various purposes such as mining relevant chemical information, prediction of biological attributes and potentially active biological molecules [130], identification of novel targets [131], evidence extraction for target disease associations [132], increasing the understanding of disease and non-disease phenotypes [133], development of better biomarkers for prognosis, progression and drug efficacy [134], etc.

In this section, we discuss only those studies wherein ML algorithms have been applied to promote NP-based anti-cancer drug discovery and simultaneously involve the development of applications based on ML. A brief overview of the web servers and softwares on ML and how they have been used in anti-cancer drug discovery is given below (Table 3).

Table 3 A brief overview of the usage of ML applications in NP-based drug discovery

CASE I

As NPs being one of the most diverse sources of lead compounds for drug discovery, Chen et al. [21] developed a method for calculating the NP-likeness of compounds to guide the identification of new lead compounds for NP drug discovery. The authors generated a reference NP dataset composed of 201,761 unique molecules from 18 virtual and 9 physical NP libraries. For the compilation of the small molecule set (SM), an equal number of 201,761 compounds were randomly selected from ZINC database such that any molecule present in the NP dataset was removed from the SM dataset. The NP and SM datasets were then merged to create a total of 403,522 compounds which were then randomly split into a training and test set in the ratio 4:1. From these compounds, three different descriptor sets (Morgan2 fingerprints, MACCS keys, and 206 two-dimensional physicochemical property descriptors) were generated for training the random forest classifiers of the ML model. To measure how accurately the models were able to rank the NPs, the performance of the ML models was evaluated using Matthew’s correlation coefficient and area under the receiver operating characteristic curve. After validation of the ML models, the model based on MACCS fingerprints was selected for use as it had demonstrated superior classification ability for the independent test set. The ability of this model to properly identify NPs was also tested using an external validation set using the Dictionary of natural products (DNP) which showed that 95% of the NPs were correctly identified by the ML model developed by the authors. Finally, a web application was developed named “NP-Scout” which can be accessed at https://nerdd.univie.ac.at/npscout/.

CASE II

In order to predict the activity of unknown molecules, traditional approaches are based on structure activity relationship (SAR), however, deviating from this approach, Yue et al. [91] developed a ML-based approach using the cumulative data of both gene expression and chemical properties of NPs. To develop the NP response predictors, 17 NP drugs across a number of cell lines from GDSC (Genomics of Drug Sensitivity in Cancer) [145] were retrieved. These 17 NPs were then screened across an average of 495 cell lines per drug where the sensitivity (IC50) values for all NPs were used to classify the cell lines into three groups (Sensitive, Resistant and Intermediate) using the K-Means clustering algorithm in Waikato Environment for Knowledge Analysis (WEKA) [146]. The samples in the sensitive and resistant groups were used to build different ML models using decision tree, support vector machine, random forest, and rotation forest algorithms. For building the ML models, thirteen NPs having 6450 cancer cell line-NP interactions were randomly selected for the training set and the remaining four NPs with 1970 cancer cell line-NP interactions was used as the test set. Thereafter, the performance of these models was evaluated to identify the best algorithm suitable for predicting the cancer cell lines’ sensitivity to the NPs. It showed that all four methods demonstrated good results based on tenfold cross-validation of the training set. The second method of evaluation was also adopted by the authors, wherein CancerHSP database was searched for anti-cancer herbs used in systems pharmacology and NP-related studies. Two anti-tumour NPs, curcumin and resveratrol, were selected for evaluating the cancer cell sensitivity to NPs. For curcumin, six out of seven cell lines were correctly predicted by the model described in this study whereas in the case of resveratrol, five out of eight were correctly predicted. Thus, this study demonstrates that using advanced techniques like ML, predictive models can be trained on cancer cell line data which can identify the potential NPs with anti-cancer activity.

CASE III

Rayan et al. [147] developed a predictive model for identifying NPs with anti-cancer activity from a set of 617 approved anti-cancer drugs and 2892 NPs. For developing the predictive model, 617 anti-cancer drugs retrieved from the CMC (Comprehensive Medicinal Chemistry) database and NCI Drug Dictionary were used as actives. On the other hand, a set of 2892 phytochemicals retrieved from AnalytiCon Discovery was used as the inactive set. Molecular Operating Environment (MOE) was used to calculate the 1D and 2D physicochemical properties (descriptors) of the molecules present in the active and inactive datasets, respectively. The molecules were then split into a training set comprised of 66.7% of the dataset and a test set comprised of the remaining molecules. Thereafter, the iterative stochastic elimination (ISE) algorithm was used to build a predictive model capable of indexing NPs for potential anti-cancer activity based on the distinction between active and inactive ligands. From the initial NP dataset taken, twelve NPs were highly ranked as potential anti-cancer agents by the ISE model and subsequent literature search demonstrated that three of the identified compounds i.e. neoechinulin, colchicine, and piperolactam (Fig. 7A–C) have established experimental evidence as anti-cancer agents, thereby further demonstrating the validity of this model.

Fig. 7
figure 7

Structures of A neoechinulin, B colchicine, and C piperolactam as identified by the works of Rayan et al., 2017

CASE IV

One of the earliest and foremost applications of ML in the field of NP drug discovery was achieved by Ertl et al. [148] wherein they used a Bayesian approach that identified how closely a given molecule resembles the structural space occupied by NPs. The DNP was taken as the set of NPs and a set of 290,000 commercially available synthetic molecules was considered for the synthetic molecule (SM) set. The authors considered intricate structural elements like particular substructures that are characteristically present in NPs to construct the NP-likeness score. Substructural fragments were generated for both the NP and SM set and the distribution between the fragments of the two groups were analysed. After model development, it was cross validated using standard statistical measures like area under the curve as well as enrichment plots. Additional testing of the NP-likeness scorer using a set of novel molecules that were not present in the starting NP set exhibited that the system was able to correctly identify ~ 93% of the molecules as NPs. Therefore, this tool can be a part of standard VS exercises along with other screening parameters.

In addition to the above, the same research group [23] has also developed a Natural-Product-Likeness scoring system, that can be downloaded as a standalone java package from https://sourceforge.net/projects/np-likeness/. It uses Taverna version 2.2, and a few other open-source Java libraries to calculate the NP-likeness of compounds. The model has been trained using NPs sourced from open-access databases, ChEMBL [149], and the Traditional Chinese Medicine Database (TCMD) [150]. For each query molecule this tool assigns a signature to each molecule, which is then used for “Signature Scoring” and calculating the NP-likeliness of compounds.

The NaPLeS NP-likeness scorer is a web-based application developed by Sorokina et. al. [22], to calculate the NP-likeness of compounds. This open-source application has been developed using a training set of 3,64,807 unique NPs and 4,89,780 unique synthetic compounds. The molecular structures and corresponding scores are stored in the MySQL 5.8 Docker Image. A unique identifier is allotted to every sourced molecule, and the NP-likeness score is computed on the parameters of heavy and total atom counts, the number of rings, the number of repeated fragments, and the number of predominant heavy atoms. The authors have also developed a web application using the Spring Boot framework which is available for public access at http://naples.naturalproducts.net.

CASE V

Researchers have also tried to generate classification models of the relationship between the chemical structures of plant-based NPs and their anti-cancerous activity using QSAR-based ML algorithms like Naive Bayesian classifier (NB), sequential minimal optimisation (SMO), instance-based learner (IBK) and random forest (RF) [20]. Using these algorithms and 881 PubChem fingerprints, the authors developed different classification models to predict the inhibitory potential of plant compounds taken from the NPACT database. The random forest algorithm using 100 trees was found to be the best among various classifiers and achieved 81.58% sensitivity, 72.44% specificity, 77.6% accuracy. A frequency-based feature selection method further revealed that the top 10 fingerprints selected by this study were also present in a wide range of natural anti-cancer drugs like vincristine, vinblastine, and paclitaxel.

Existing phytomolecule databases and web resources

NP-based drug discovery is an extensive field of study, in which phytomolecule-based anti-cancer drug discovery is a niche domain. There exists a vast amount of literature and experimental data regarding plant-based NPs that show potential activity against different diseases including cancer. These data are a treasure trove of unique scaffolds and compounds that have been accumulated and represented in the form of different phytomolecule databases as given in Table 4. Most of these databases focus on phytomolecules and their therapeutic potential. There exists a large number of resources that provide extensive lists of NPs used for drug discovery purposes [2, 9, 26, 28, 151, 152], but in accordance with the scope of the review, this section briefly discusses those resources that specifically cater to the information and use of phytomolecules in drug discovery.

Table 4 Existing phytomolecule resources

The phytomolecule databases/resources described in the current review (Table 4) have been classified based on their geographical locations, compound class, traditional medicine system and bioactivity (Fig. 8). For example, in the geographical location category, we observe that there exist databases that contain the phytomolecules native to certain specific regions of the world like India [83, 84, 153], Brazil [86], Africa [7], Cameroon [152] and Taiwan [154]. But more specifically to the Indian subcontinent, we observe that there are certain highly specific databases like Phytochemica [83] and MedPServer [153] that contain information on the phytomolecules found in plants present in the Himalayan bioresource and the North-Eastern region of the country (Fig. 8). Closely related to this group is the Traditional medicine system which is based on the easily and widely available flora of their native regions. For example, the TM-MC [155] database contains information about the phytomolecules found in the traditional plant-based medicines of Japan, Korea, and China while in the case of Traditional Chinese Medicine (TCM) a large number of resources are available. Further, in the case of TCM, it is seen that there exists a database that houses the plant-based NPs used by the Chinese ethnic minority communities (Fig. 8). Since phytomolecules are highly diverse they can be classified based on their structural similarities also. Therefore, a number of databases have been built that cater to the phytomolecules of specific classes like carotenoids [156], triterpenes [157], polyphenols [158], etc. (Fig. 8). Finally, there also exist databases that report the bioactivity information of different phytomolecules against different cancers at the cell line and protein levels, e.g. NPACT (Fig. 8). Therefore, all these databases can be evaluated before selecting and curating a library for CADD against any cancer target. Thus, selection of an appropriate library is an important step in CADD as it determines the type of molecules that are being studied as well as the nature of the selected/identified molecules at the end of the study.

Fig. 8
figure 8

Different phytomolecule resources and their classification

We hope that the aggregated information provided in this section will serve as an important resource for anyone working in the field of phytomolecule-based anti-cancer drug discovery.

Gaps and future prospects

The advent of computer-aided techniques has transformed the entire landscape of drug discovery research. In the field of anti-cancer drug discovery, the use of plant-derived NPs have been discussed in this review, which highlights that some of the traditional techniques discussed here are being used more in comparison to, the newer techniques. For example, ML-based techniques were found to be extensively used in both anti-cancer research and NP-based research separately, but limited evidence exists for their use in NP derived anti-cancer drug discovery. A reason for this may lie in the fact that few ML-based models have been developed for predicting the anti-cancer activity of NPs, out of which only some are publicly available. This in turn is an area of opportunity for future research. More efforts need to be put in developing publicly available tools and resources regarding the use of ML-based techniques in drug discovery. In this review, we have also highlighted the various plant-derived NP-based tools and databases that are important for CADD. But there is still a gap in the resources available for NP derived anti-cancer drug discovery, with NPACT and AfroCancer being the only two databases that cater to this specific genre. Thus, there is a need for more structured and systematised resources that can successfully accommodate the complexity and diversity of the data required for NP drug discovery. The current landscape of plant-derived NP resources has the apparent disadvantage of either being very specific or very general in nature. Consequently, the diversity of different plant species is somewhat lost and even if they are very comprehensive in nature, they have no method to demarcate those NPs that are important specifically with respect to anti-cancer drug discovery. Therefore, this review highlights several areas of plant-based anti-cancer drug discovery that need more attention from the scientific community worldwide.

Conclusion

Overall, the current review discusses in detail the niche area of anti-cancer drug discovery from plant-based NPs. Efforts have been made to describe the current practices like molecular docking and dynamics, VS, pharmacophore modelling, and QSAR that are widely being used in the drug discovery process. Additionally, we have also highlighted and discussed in detail the more recent techniques like ML and network pharmacology that are not as widely used as the traditional techniques. Therefore, we hope this review will be of use to the scientific community not only to understand the tools and techniques used in traditional drug discovery process but also to learn and implement the next-generation techniques as well.