Introduction

From the past two decades, the development of efficient and advanced systems for the targeted delivery of therapeutic agents with maximum efficiency and minimum risks has imposed a great challenge among chemical and biological scientists [1]. Further, the cost of development and time consumption in developing novel therapeutic agents was another setback in the drug design and development process [2]. To minimize these challenges and hurdles, researchers around the globe moved toward computational approaches such as virtual screening (VS) and molecular docking, which are also known as traditional approaches. However, these techniques also impose challenges such as inaccuracy and inefficiency [3]. Thus, there is a surge in the implementation of novel techniques, which are self-sufficient to eliminate the challenges encountered in traditional computational approaches. Artificial intelligence (AI), including deep learning (DL) and machine learning (ML) algorithms, has emerged as a possible solution, which can overcome problems and hurdles in the drug design and discovery process [4]. Additionally, drug discovery and designing comprise long and complex steps such as target selection and validation, therapeutic screening and lead compound optimization, pre-clinical and clinical trials, and manufacturing practices. These all steps impose another massive challenge in the identification of effective medication against a disease. Thus, the biggest question that arises in front of pharmaceutical companies is managing the cost and speed of the process [5]. AI has answered all these questions in a simple and scientific manner, which reduced the time consumption and cost of the process. Moreover, the increase in data digitization in the pharmaceutical companies and healthcare sector motivates the implementation of AI to overcome the problems of scrutinizing the complex data [6].

AI, which is also referred to as machine intelligence, means the ability of computer systems to learn from input or past data. The term AI is commonly used when a machine mimics cognitive behavior associated with the human brain during learning and problem solving [7]. Nowadays, biological and chemical scientists extensively incorporate AI algorithms in drug designing and discovery process [8]. Computational modeling based on AI and ML principles provides a great avenue for identification and validation of chemical compounds, target identification, peptide synthesis, evaluation of drug toxicity and physiochemical properties, drug monitoring, drug efficacy and effectiveness, and drug repositioning [9]. With the advent of AI principles along with ML and DL algorithms, VS of compounds from chemical libraries, which comprises more than 106 million compounds, become easy and time-effective. Further, AI models eliminate the toxicity problems, which arise due to off-target interactions [10]. Herein, we briefly discuss the evolution of AI from ML to DL and big data involvement in revolutionizing the drug discovery process. Later on, we presented an overview on the congregation of AI and conventional chemistry in the improvement of the drug discovery process and the application of AI in the improvement of the traditional drug discovery process. Afterward, we discuss the numerous AI applications throughout the drug design and discovery processes such as primary and secondary screening, drug toxicity, drug release and monitoring, drug dosage effectiveness and efficacy, drug repositioning, and polypharmacology, and drug-target interactions.

Evolution of artificial intelligence: machine learning to deep learning

In September 2015, the Google search trend showed that after the introduction of ML, AI was the most searched term. Some describe ML as the primary AI application, while others describe it as a subset of AI [11, 12]. AI is an umbrella term where computer programs are able to think and behave as humans do, whereas ML is beyond that where data are inputted in the machine along with an algorithm like Naïve Bayes, decision tree (DT), hidden Markov models (HMM) and others, which helps the machine to learn without being explicitly programmed. Later, with the development of neural networks, machines could classify and organize inputted data that mimics like a human brain, which further shows advancement in AI. Around twentieth century, Igor Aizenberg and his colleagues, while talking about the artificial neural network (ANN), brought up the term “deep learning” for the first time. DL is a subset of ML, which itself is a subset of AI, and thus, the evolution goes like AI > ML > DL [13, 14]. ML either uses supervised learning, where the model is trained to use labeled data, which means that the input has been tagged with corresponding preferred output labels or uses unsupervised learning, where the model is trained to use unlabeled data but looks for recurring patterns from the input data [15]. Others are semi-supervised learning that uses the combination of both supervised and unsupervised learnings; self-supervised learning, which is a special case, uses a two-step process where unsupervised learning generates labels for unlabeled data and its ultimate goal is to make supervised learning model; reinforcement learning is a type of ML which improves its algorithm over time with the help of a constant feedback loop and lastly DL where there are many layers of ML algorithms which is called as a brain-inspired family of algorithms which mimics human brain but requires high computational power for training and big data to succeed [16, 17]. The origin of ML dates back to 1943 when McCulloch and Pitts published an article named “A logical calculus of the idea immanent in nervous activity,” where they gave the first-ever mathematical model of a neural network [18]. Alan M. Turing theorized the concept of ML in his seminal paper published in 1950 [19]. In 1952, Arthur L. Samuel popularized the term “machine learning” by writing a checker-playing program for IBM [20]. In 1957, Frank Rosenblatt developed perceptron, which was built for image recognition [21]. Henry J. Kelley developed the continuous backpropagation model in 1960, and a simpler version based only on-chain rule was developed by Stuart Dreyfus in 1962 [22, 23]. In 1965, Ivakhnenko and Lapa developed the first working DL networks. Around 1980, Kunihiko Fukushima developed an ANN called neocognitron that had a multilayered design that could help the computer learn how to recognize visual patterns [24]. He also developed the first convolutional neural network (CNN) which was based on the visual cortex organization found in animals [25] [Fig. 1].

Fig. 1
figure 1

a History of artificial intelligence in healthcare: the first breakthrough of artificial intelligence in healthcare comes in 1950 with the development of turning tests. Later on, in 1975, the first research resource on computers in medicines was developed, followed by NIH's first central AIM workshop marked the importance of artificial intelligence in healthcare. With the development of deep learning in the 2000s and the introduction of DeepQA in 2007, the scope of artificial intelligence in healthcare has increased. Further, in 2010 CAD was applied to endoscopy for the first time, whereas, in 2015, the first Pharmbot was developed. In 2017, the first FDA-approved cloud-based DL application was introduced, which also marked the implementation of artificial intelligence in healthcare. From 2018 to 2020 several AI trials in gastroenterology were performed. b Classification of artificial intelligence: there are seven classifications of artificial intelligence, which are reasoning and problem solving, knowledge representation, planning and social intelligence, perception, machine learning, robotics: motion and manipulation, and natural language processing, as discussed by Russel and Norvig in their book “Artificial Intelligence: A Modern Approach.” Machine learning is further divided into three significant subsets: supervised learning, unsupervised learning, and deep learning, whereas vision is divided into two subsets, such as image recognition and machine vision. Similarly, speech is divided into two subsets: speech to text and text to speech, whereas natural language processing is classified into five main subsets, including classification, machine translation, question answering, text generation, and content extraction. c Artificial intelligence in the healthcare and pharmaceutical industry has five significant applications, which change the entire scenario. These applications include research and discovery, clinical development, manufacturing and supply chain, patient surveillance, and post-market surveillance

David Rumelhart, Geoffrey Hinton, and Ronald J. Williams published a paper entitled “Learning Representations by Back-propagating Errors” in 1986, which demonstrated that backpropagation could provide an improvement in shape recognition and word prediction [26]. After the initial success, there were some setbacks, but Hinton kept working during the second AI Winter to achieve new heights. Thus, he is considered as the Godfather of DL. Soon, in 1989, Yann LeCun gave the first practical demonstration of backpropagation at Bell Labs [27]. The same year, Christopher Watkins published his thesis entitled “Learning from Delayed Rewards,” which introduced the concept of Q-learning, which further improved reinforcement learning in computer programs [28]. In 1995, Corinna Cortes and Vladimir Vapnik developed support vector machines (SVM) to map and recognize similar data [29]. After two years, in 1997, Jürgen Schmidhuber and Sepp Hochreiter developed long short-term memory (LSTM) for recurrent neural networks [30].

In 1999, a graphic processing unit (GPU) was launched as a microprocessor circuit, which was developed initially to accelerate 3D graphics processing for computer gaming. Later on, GPUs became popular in the field of technology and research as well because of their ability of parallel computing. A research report presented by META Group in 2001 stated that volume, speed, source and types of data were increasing, which was a call to prepare for the attack of Big Data. In 2007 Nvidia introduced compute unified device architecture (CUDA), a framework that allowed programmers and researchers to use GPU for general purpose computing [31]. Since then, with the help of CUDA, researchers started using GPUs for DL-driven operations, as high memory bandwidth of GPUs allowed easy handling of massive data involved in DL algorithms, and thousands of cores in GPUs allowed simultaneous parallel processing of neural networks. In 2009, Fei-Fei Li launched ImageNet, which is a free database containing millions of labeled images that can be used for research purposes [32]. AlexNet, a convolutional neural network, was created by Alex Krizhevsky around 2012, which helped in strengthening the speed and dropout using rectified linear units [33]. In the same year, “the cat experiment” conducted by Google Brain concluded that the network correctly recognizes less than 16% of the presented objects [34]. In 2014 Nvidia introduced CUDA deep neural network (cuDNN), a CUDA-based DL library, which accelerated DL-based operations [35]. Similarly, “Deep Face” was developed and released in 2014 to identify faces with 97.5% accuracy [36]. In the same year, generative adversarial networks (GANs) were introduced, using two competing neural networks to check whether the data are genuine or generated [37]. In 2016, Cray Inc. used Microsoft’s neural network software on its XC50 supercomputer with 1000 Nvidia Tesla P100 GPUs that could perform the task and gave output in a fraction of seconds. In 2017 Nvidia introduced Tesla V100 GPU, which had tensor cores that accelerated AI-based operations. However, DL is still in its growth phase, and creative ideas are required for further advancement in this field.

Revolutionizing drug discovery process: role of big data and artificial intelligence

Big data can be defined as data sets that are too gigantic and intricate to be analyzed with the conventional data analyzing software, tools, and techniques. The three main characteristic features of big data are volume, velocity, and variety, where volume represents the huge amount and mass of data generated, velocity represents the rate at which these data are being reproduced, and variety represents heterogenicity present in the data sets [38]. With the advent of microarray, RNA-seq, and high-throughput sequencing (HTS) technologies, a plethora of biomedical data is being engendered every day, due to which contemporary drug discovery has made a transition into the big data era. In drug discovery, the first and foremost step is the identification of appropriate targets (e.g., genes, proteins) involved in disease pathophysiology, followed by finding suitable drugs or drug-like molecules which can meddle with these targets, and now we have access to a constellation of biomedical data repositories which can help us in this regard [39]. Moreover, the evolution of AI has made big data analytics a lot easier as there is a myriad of ML techniques available now, which can help in extracting useful features, patterns, and structures present in these big biomedical data sets [40]. For target identification, a feature like a gene expression is widely used to understand disease mechanisms and find genes responsible for the disease. Microarray and RNA-seq technologies have generated a large amount of gene expression data for various disorders. NCBI Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/) [41], The Cancer Genome Atlas (TCGA) (https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) [42], Arrayexpress (https://www.ebi.ac.uk/arrayexpress/) [43], are some of the big repositories which contain gene expression data. By analyzing gene expression signatures, we can find out target genes responsible for different disorders. For example, using the ML approach and gene expression data, van IJzendoorn et al. 2019 found out novel biomarkers and potential drug targets for rare soft tissue sarcoma [44].

Further, genome-wide association studies (GWAS) can determine the interrelation of genomic variants with particular complex disorders [45]. GWAS central (https://www.gwascentral.org/) [46], NHGRI-EBI GWAS Catalog (https://www.ebi.ac.uk/gwas/home) [47] are some of the repositories which contain GWAS data. Further, with the help of GWAS, we can ascertain the disease-associated genetic loci, and it has been observed that genes linked with these loci are potential therapeutic targets. For instance, Li et al. [48] used the GWAS catalog, gene expression, epigenomics, and methylation data to determine target genes associated with juvenile idiopathic arthritis loci through ML analysis . In addition, specific genes whose mutations can lead to different threatening diseases are also promising therapeutic targets. These risk genes can be identified by analyzing the various genome and exome sequencing data. For sequencing data, we have public repositories like Sequence read archive (https://www.ncbi.nlm.nih.gov/sra) [49], which contains sequencing data obtained from next-gen sequencing technology. The National Cancer Institute Genomic Data Commons (NCIGDC) (https://gdc.cancer.gov/) [50] and TCGA are data repositories that contain sequencing data related to cancer. Moreover, taking advantage of big data and AI, Han et al. 2019 have developed DriverML (https://github.com/HelloYiHan/DriverML), a supervised ML-based tool that can point out driver genes related to cancer [51] [Fig. 2].

Fig. 2
figure 2

Application of big data for drug designing and discovery: with the increase in biological and chemical data from the literature, in vitro, in vivo, clinical studies, genomics studies, proteomics studies, metabolomics studies, gene ontology studies, and molecular pathway data, different data repositories have been developed. For instance, ChemSpider, ChEMBL, ZINC, BindingDB, and PubChem are the essential databases for compound synthesis and screening in the drug designing and discovery process. The data stored in the above-said databases were curated and screened out for pharmacological and physicochemical properties of compound necessary for the drug discovery process instead of quantum mechanical calculations such as solvation energy and proton affinity the wave function, atomic forces, and transition state. The high-throughput screened data were subject to filtration based on drug-likeness, PAINS calculation, ADMET analysis, and toxicity. The filtered compounds were subject to artificial intelligence models such as deep learning, random forest, classification and regression, and neural networks for further analysis. These compounds were then subjected to quantitative-structure activity relationship and pharmacophore models followed by molecular docking and molecular dynamics simulations studies. Afterward, the final predicted compounds were visualized for binding energy calculations and active site identification. Thus, the final compound was identified and underwent in vitro and in vivo experimental studies for validation. However, quantum mechanical properties play a crucial role in the process of drug discovery and designing, but these properties cannot directly hamper the process of drug designing. QM methods include ab initio density functional theory and semi-empirical calculations, where accurate calculations use electron correlation methods. QM will become a more prominent tool in the repertoire of the computational medicinal chemist. Therefore, modern QM approaches will play a more direct role in informing and streamlining the drug-discovery process

Moreover, sometimes even published literature can be used for target identification, and PubMed (https://pubmed.ncbi.nlm.nih.gov/) [52] is a major repository of the various published biomedical literature, whose data mining can help in identifying targets for different disorders. After an appropriate target has been identified and validated, the next step is to find suitable drugs and/or drug-like molecules that can interact with the target and elicit the desired response [53]. In the age of big data, the multitude of big chemical databases is at our disposal, which can help in finding perfect drugs for a specific target. Likewise, PubChem (https://pubchem.ncbi.nlm.nih.gov/) [54] is a freely accessible chemical database that contains data of various chemical structures, including their biological, physical, chemical, and toxic properties [55]. Further, the ChEMBL database (https://www.ebi.ac.uk/chembl/) [56] is an open access big database containing data of numerous bioactive compounds exhibiting drug-like properties [57]. The ChEMBL database also contains information on absorption, distribution, metabolism, and excretion (ADME), toxicity properties of these compounds, and even their target interactions. Further, DrugBank (https://go.drugbank.com/) [58] is another open access pharmaceutical data repository which contains data of various drugs, their targets, and mechanism [59]. Additionally, the library of integrated network-based cellular signature (LINCS) L1000 (https://lincsproject.org/LINCS/) [60] is another repository that contains information on the change in gene expression signatures of human cell lines when treated with different chemical compounds. LINCS L1000 data-driven search engine, known as L1000CDS2, is an open-access search engine that contains data of drugs that can revert the expression of differentially expressed genes; hence, they too can be used for drug discovery [61]. Further, the protein data bank (PDB) (https://www.rcsb.org/) [62] is another freely accessible online repository that contains data of three-dimensional structures of proteins, DNA, RNA [63]. PDB data are also widely used to assess protein–ligand interactions and then find appropriate inhibitors of a target protein. Xu et al. [64] combined ML and molecular docking to find inhibitors of COVID 3CL proteinase; here, the crystal structure of COVID 3CL proteinase was obtained from PDB.

Congregation of artificial intelligence and conventional chemistry: improves drug discovery

In the pharmaceutical industry, AI has emerged as a possible solution to the problems raised due to classical chemistry or chemical space, which hampers drug discovery and development. With the advancements in technologies and the development of high-performance computers, AI algorithms such as ML to DL have been increased in computer-aided drug design (CADD). AI is not a new technique for scientists in drug discovery and development; neither chemists' desire to accurately forecast chemical activity-structure relationships. For example, Hammett relates equilibrium constants with reaction rates, whereas Hansch performed computer-assisted prediction of drug compounds' physicochemical properties and biological activity. The success of Hansch provides an avenue for research that will focus on (a) detailed identification and prediction of the chemical structure along with the characterization of properties such as pharmacophores and three-dimensional structure and (b) hypothesize complex mathematical equations that will relate to chemical representation and biological activity of the predicted compound. However, scientists' main aim in the current era is to improve the drug discovery and development process with high accuracy and confidence scores through ML algorithms based on classical chemistry activities. This will encourage chemists to identify the potential of AI techniques for answering two crucial questions of medical chemistry, such as "what should be the next compound?” and "what is the process of making a compound?”. Thus, the last two decades developed many techniques and tools for computational drug discovery, quantitative-structure activity relationship (QSAR) methods, and free-energy minimization techniques. For example, [65] distinguish compound cell activity using machine intelligence methods such as DT, random forest (RF) method, CNN, SVM, LSTM network, and gradient boosting machine. Among the mentioned models, in some models, the compounds were expressed as a string by the simplified molecular input line entry system and directly used as input data instead of any chemical descriptor and act as natural language processing. They have used two different cutoffs for the single data set (Z-score = 3) and the whole data set (Z-score = 5 or 6). Later on, they incorporated nine different metrics used to evaluate the model's precision, accuracy, the area under the curve, and Cohen's K value. The results demonstrated that the gradient boosting machine is competent at balanced data distribution. The experiment's outcomes also concluded that classical ML methods and DL methods could classify compound cell activity [65]. Similarly, [66] predicted the PAMPA effective permeability using a two-QSAR approach, where the authors developed a classical QSAR model and an ML-based QSAR model using a partial least square (PLS) scheme and hierarchical SVM (HSVR) scheme. The authors concluded that the HSVR scheme executed better than the PLS scheme in the training set, test set, and statistical analysis [66]. Further, for the synthesis of new compounds, chemical scientists readily depended on published literature. With advancements in automated drug discovery methods involving AI and ML, it is relatively simple to distinguish between existing drugs and novel chemical structures. For example, [67] applied a computational approach to screen the hepatotoxic ingredients in traditional Chinese medicines, whereas [68] demonstrated the phylogenetic relationship, structure–toxicity relationship, and herb-ingredient network using computational technique. Recently, Zhang et al. implemented computational analysis against a novel coronavirus, where the authors screened different compounds that were biologically active against severe acute respiratory syndrome (SARS). Later on, the compounds were subjected to ADME and docking analysis. The results concluded that 13 existing Chinese traditional medicines were effective against novel coronavirus [69]. Thus, conventional chemistry-oriented drug discovery and development concepts combined with computational drug designing provide a great future research platform. Moreover, system biology and chemical scientists worldwide, in coordination with computational scientists, develop modern ML algorithms and principles to enhance drug discovery and development.

Transforming traditional computational drug design through artificial intelligence and machine learning techniques

For many years computational methods have played an essential role in drug design and discovery, which transformed the whole process of drug design. However, many issues like time cost, computational cost, and reliability, are still associated with traditional computational methods [70, 71]. AI has the potential to remove all these bottlenecks in the area of computational drug design, and it also can enhance the role of computational methods in drug development. Moreover, with the advent of ML-based tools, it has become relatively easier to determine the three-dimensional structure of a target protein, which is a critical step in drug discovery, as novel drugs are designed based on the three-dimensional ligand biding environment of a protein [72, 73]. Recently, Google’s DeepMind (https://github.com/deepmind) has devised an AI-based tool trained on PDB structural data, referred to as AlphaFold, which can predict the 3D structure of proteins from their amino acid sequences [74]. AlphaFold predicts 3D structures of proteins in two steps: (i) firstly, using a CNN it transforms an amino acid sequence of a protein to distance matrix as well as a torsion angle matrix, (ii) secondly, using a gradient optimization technique it translates these two matrices into the three-dimensional structure of a protein [75]. Likewise, Mohammed AlQuraishi from Harvard Medical school has also designed a DL-based tool that takes protein’s amino acid sequence as input and generates its three-dimensional structure. This model, referred as Recurrent Geometric Network (https://github.com/aqlaboratory/rgn), uses a single neural network to figure out bond angles and angle of rotation of chemical bonds connecting different amino acids in order to predict the three-dimensional structure of a given protein [76].

Further, quantum mechanics is used to determine the properties of molecules at a subatomic level, which is used to estimate protein–ligand interactions during drug development. However, sometimes with conventional computational techniques, quantum mechanics can be computationally very expensive and demanding, which can affect its accuracy [77]. However, with AI, quantum mechanics can get more user-friendly and efficacious. Schtutt et al. 2019 have recently developed a DL-driven tool, referred to as SchNOrb (https://github.com/atomistic-machine-learning/SchNOrb), which can predict molecular orbitals and wave functions of organic molecules accurately. With these data, we can determine the electronic properties of molecules, the arrangement of chemical bonds around a molecule, and the location of reactive sites [78]. Thus, SchNOrb can help researchers in designing new pharmaceutical drugs. Moreover, molecular dynamics (MD) simulation analyzes how molecules behave and interact at an atomistic level [79]. In drug discovery, MD simulation is used to evaluate protein–ligand interactions and binding stability. One major issue with MD simulation is that it can be very arduous and time-consuming. AI has the capacity to accelerate the process of MD simulation [80]. In this regard, Drew Bennett et al. performed MD simulations to calculate free energies for transferring 15,000 small molecules from water to cyclohexane to train a 3D convolutional network and spatial graph CNN using these free energies and some other atomistic features. The researchers found that the trained neural networks predicted free energies of transfer with almost similar accuracy compared to MD simulation calculations [81]. This study shows that ML techniques can improvize and expedite MD simulations. However, a large amount of training data is required to achieve this.

Moreover, de novo drug design has also taken advantage of AI in recent years. For example, Q.Bai et al. 2020 have devised MolAIcal (https://molaical.github.io/), a tool that can design three-dimensional drugs in three-dimensional protein pockets [82]. MolAICal designs 3D drugs by action of two components: (i) first component uses DL and genetic algorithm trained on the US food and drug administration (FDA)-approved drugs, for de novo drug design, (ii) second component combines molecular docking and DL model trained on ZINC database (https://zinc.docking.org/) [83]. Likewise, Popova et al. 2018 designed a deep reinforcement learning-based algorithm, referred to as ReLeaSE (https://github.com/isayev/ReLeaSE), for de novo drug design. ReLeaSE achieves its desired outcome by integrating two deep neural networks (DNN), known as generative and predictive, where the generative model is used to produce new compounds, and the predictive model is used to predict the properties of the compound [84]. Further, in recent times, AI has been used to upgrade the process of synthesis planning as well, a process that is used to determine an optimal synthesis pathway for a molecule of interest. Recently, Grzybowski et al. [85] developed a DT-based program, referred to as chematica, to design novel synthesis pathways for desired molecules. Similarly, Genheden et al. have implemented AiZynthFinder (https://github.com/MolecularAI/aizynthfinder), an open-source tool for retrosynthesis planning built on Monte Carlo tree search, which is regulated by a neural network [86]. Likewise, Segler et al. [87] used the integration of three distinct neural networks in conjugation with the Monte Carlo tree search to discover novel retrosynthesis routes. ICSYNTH (https://www.deepmatter.io/products/icsynth/) is another tool that can produce novel chemical synthesis pathways by using a collection of chemical rules which are generated via ML models [88].

Additionally, various text mining-based tools have also been developed, which can aid the process of traditional drug discovery. Text mining uses methods like natural language processing (NLP) to transform unstructured texts in various literature and databases into structured data, which can be analyzed appropriately to gain new insights. NLP is a branch of AI, which allows computers to process and analyze human languages like speech and text through AI-based algorithms. Taking advantage of this AI driven techniques, various text mining-based tools have been developed. For instance, Jang et al. 2018 developed PISTON (http://databio.gachon.ac.kr/tools/PISTON/), a tool that can predict drug side effects and drug indications, using NLP and topic modeling [89]. Likewise, DisGeNET (https://www.disgenet.org/) is a text mining-driven database that contains a plethora of information on gene-disease and variants-disease relationships [90]. Data in DisGeNET can analyze various biological processes like adverse drug reactions, molecular pathways involved in disease, drug action on targets. Further, STRING (https://string-db.org/) is another text mining-driven database containing a myriad of information on protein–protein interactions for various organisms [91]. In addition, STITCH (http://stitch.embl.de/) is another text mining-driven database, which contains information on interactions between proteins and chemicals/small molecules [92]. Information in STICH can also be used to ascertain binding affinities of drugs and drug-target association.

Artificial intelligence in primary and secondary drug screening

Today AI has come out as a very successful and demanding technology because it saves time and is cost-efficient [93]. In general, cell classification, cell sorting, calculating properties of small molecules, synthesizing organic compounds with the help of computer programs, designing new compounds, developing assays, and predicting the 3D structure of target molecules are some time-consuming and tiresome tasks which with the help of AI can be reduced and can speed up the process of drug discovery [94, 95]. The primary drug screening includes the classification and sorting of cells by image analysis through AI technology. Many ML models using different algorithms recognize images with great accuracy but become incompetent when analyzing big data. To classify the target cell, firstly, the ML model needs to be trained so that it can identify the cell and its features, which is basically done by contrasting the image of the targeted cells, which separates it from the background [96]. Images with varying textured features like wavelet-based texture features and Tamura texture features are extracted, which is further reduced in dimensions through principal component analysis (PCA). A study suggests that least-square SVM (LS-SVM) showed the highest classification accuracy of 95.34% [97, 98]. Regarding cell sorting, the machine needs to be fast to separate out the targeted cell type from the given sample. Evidence suggests that image-activated cell sorting (IACS) is the most advanced device that could measure the optical, electrical, and mechanical properties of the cell [99] [Fig. 3].

Fig. 3
figure 3

Artificial intelligence in primary and secondary drug screening: in drug discovery and designing pipeline, screening of potential lead is crucial, and artificial intelligence plays a great role in identifying novel and potential lead compounds. There are approximately 106 million chemical structure presents in chemical space from different studies such as OMIC studies, clinical and pre-clinical studies, in vivo assays, and microarray analysis. With machine learning models such as reinforcement models, logistic models, regression models, and generative models, these chemical structures are screened out based on active sites, structure, and target binding ability. The complete drug discovery process through artificial intelligence will take about 14–18 years, which is comparatively less than the traditional drug discovery process. The first step in the drug discovery process is lead identification, in which disease-modifying target protein is identified through reverse docking, bioinformatics analysis, and computational chemical biology. In the second step, primary screening of compounds is done to select potential lead compounds, which can inhibit target protein. This can be done through virtual screening and de novo designing. The next step in the drug discovery process includes lead optimization and lead compound identification through focused library design, drug-like analysis, drug-target reproducibility, and computational biology. Afterward, secondary screening of compounds is performed, followed by pre-clinical trials. The drug discovery process's final step is clinical development through cell-culture analysis, animal model experimentation, and patient analysis

The secondary drug screening includes analyzing the physical properties, bioactivity, and toxicity of the compound. Melting point and partition coefficient are some of the physical properties that govern the compound's bioavailability and are also essential to design new compounds [100], while designing a drug, molecular representation can be done using different methods like molecular fingerprinting, simplified molecular-input line-entry system (SMILES), and Coulomb matrices [101]. These data can be used in DNN, which comprises two different stages, namely generative and predictive stage. Though both the stages are trained separately through supervised learning, when they are trained jointly, bias can be applied to the output, where it is either rewarded or penalized for a specific property. This whole procedure can be used for reinforcement learning [84]. Matched molecular pair (MMP) has been extensively used for QSAR studies. MMP is associated with a single change in a drug candidate, which further influences the bioactivity of the compound [102]. Along with MMP, other ML methods are used like DNN, RF, and gradient boosting machines (GBM) to get modifications. It has been observed that DNN can predict better than RF and GBM [103]. With the increase in databases, which are publicly available like ChEMBL, PubChem, and ZINC, we have access to millions of compounds annotating information like their structure, known targets and purchasability; MMP plus ML can predict bioactivity like oral exposure, intrinsic clearance, ADMET, and method of action [98, 104, 105]. Optimizing the toxicity of a compound is the most time-consuming and expensive task in drug discovery and is a crucial parameter as it adds significant value to the drug development process.

Applications of artificial intelligence in drug development process

The most arduous and desponding step in the drug discovery and development process is identifying suitable and bioactive drug molecules present in the vast size of chemical space, which is in the order of 1060 molecules. Further, the drug discovery and development process are considered a time- and cost-consuming process. The most infuriating point is that nine out of ten drug molecules usually fail to pass phase II clinical trials and other regulatory approvals [106,107,108]. The above-said limitations of drug discovery and development can be addressed by implementing AI-based tools and techniques. AI is involved in every stage of the drug development process such as small molecules design, identification of drug dosage and associated effectiveness, prediction of bioactive agents, protein–protein interactions, identification of protein folding and misfolding, structure and ligand-based VS, QSAR modeling, drug repurposing, prediction of toxicity and bioactive properties, and identification of mode of action of drug compounds as discussed below.

Peptide synthesis and small molecule design

Peptides are a biologically active small chain of around 2–50 amino acids, which are increasingly being explored for therapeutic purposes as they have the ability to cross the cellular barrier and can reach the desired target site [109]. In recent years, researchers have taken advantage of AI and used it to discover novel peptides. For instance, Yan et al. 2020 developed Deep-AmPEP30, a DL-based platform for the identification of short anti-microbial peptides (AMPs) [110]. Deep-AmPEP30 (https://cbbio.online/AxPEP/) is a CNN-driven tool that predicts short AMPs from DNA sequence data. Using Deep-AmPEP30, Yan et al. identified novel AMPs from the genome sequence of C. glabrate, a fungal pathogen present in the GI tract. Likewise, Plisson et al. 2020 combined the ML algorithm with an outlier detection technique to discover AMPs with non-hemolytic profiles [111]. In addition, Kavousi et al. developed IAMPE (http://cbb1.ut.ac.ir/), a web server for the identification of anti-microbial peptides, which integrates 13CNMR-based features and physicochemical features of peptides as input to ML algorithms, in order to identify novel AMPs [112]. Similarly, Yi et al. 2019 devised ACP-DL (https://github.com/haichengyi/ACP-DL), a DL-based tool for the discovery of novel anti-cancer peptides [113]. ACP-DL uses the LSTM algorithm, which is an improved version of the recursive neural network (RNN), for differentiating anti-cancer peptides from non-anti-cancer peptides. Moreover, Yu et al. [114] proposed DeepACP, a deep recurrent neural network-based model for identifying anti-cancer peptides. Likewise, Tyagi et al. 2013 developed an SVM-based platform for identifying new anti-cancer peptides [115]. In addition, Rao et al. 2020 combined a graphical convolutional network and one-hot encoding to design ACP-GCN for the discovery of anti-cancer peptides [116]. Moreover, Grisoni et al. used an ensemble of four counter propagation ANN for identifying new anti-cancer peptides. Likewise, Wu et al. [117] proposed PTPD, a tool based on CNN and word2vec, for the discovery of novel peptides for therapeutics.

Moreover, small molecules are molecules that have very low molecular weight, and like peptides, small molecules are too being explored for therapeutic purposes using AI-based tools. For instance, Zhavoronkov et al. [118] devised generative tensorial reinforcement learning (GENTRL), a generative reinforcement learning-based tool for the de novo design of small molecules. With the help of GENTRL (https://github.com/insilicomedicine/GENTRL), Zhavoronkov et al. discovered novel inhibitors of an enzyme, DDR1 kinase [118]. Likewise, McCloskey et al. [119] combined DNA-encoded small molecule libraries (DEL) data with ML models like Graph CNN and RF to discover novel small drug-like molecules. Similarly, Xing et al. [120] integrated XGBoost, SVM, and DNN to find small molecules for targets implicated in rheumatoid arthritis.

Identification of drug dosage and drug delivery effectiveness

Administering an improper dose of any drug to a patient can lead to undesirable and lethal side effects; hence, it is crucial to determine a safe drug dose for treatment purposes. Over the years, it has been challenging to ascertain the optimum dose of a drug that can achieve the desired efficacy with minimum toxic side effects [121]. With the emergence of AI, lots of researchers are taking the help of ML and DL algorithms to determine appropriate drug dosage. For instance, Shen et al. [122] developed an AI-based platform, referred to as AI-PRS, to determine the optimum dose and combinations of drugs to be used for HIV treatment through antiretroviral therapy. AI-PRS is a neural network-driven approach, which relates drug combinations and dosage to efficacy through a parabolic response curve (PRS). In their study, Shen et al. administered a combination of tenofovir, efavirenz, and lamivudine to 10 HIV patients, and in due course, using the PRS method, they found out the dose of tenofovir could be reduced by 33% of the starting dose without causing virus relapse. Hence, using AI-PRS optimum drug dosage can be found out for other diseases as well. Further, Pantuck et al. [123] developed CURATE.AI, to determine adequate drug dose, which uses a patient’s personal data and transforms it to CURATE.AI profile in order to ascertain optimum dose. The study was performed, where a combination of cancer drug enzalutamide and investigation drug ZEN-3694 was given to a patient with metastatic castration-resistant prostate cancer. Using CURATE.AI, in the course of time, they found a 50% lower than starting dose of ZEN-3694, which can achieve desired results and arrest the cancer growth.

Further, Julkunen et al. [124] devised comboFM (https://github.com/aalto-ics-kepaco/comboFM), a novel ML-driven tool, which ascertain appropriate drug combinations and dose in pre-clinical studies like cancer cell lines. comboFM determines appropriate drug combinations and dose by using factorization machines (https://github.com/geffy/tffm), an ML framework for high-dimensional data analysis. In their study, using comboFM, Julkunen et al. identified a novel combination of anti-cancer drugs crizotinib and bortezomib, showing promising efficacy in lymphoma cell lines. Similarly, Sharabiani et al. used the ML approach to determine the optimum initial dose of anticoagulant drug warfarin. They used relevance vector machines to classify different patients based on their dose demands, and then, regression models were used to predict appropriate doses for the patients [125]. Likewise, Nemati et al. [126] developed a deep reinforcement learning model trained on multiparameter intelligent monitoring in intensive care II database (MIMIC II) to find an ideal dose of another anticoagulant drug, heparin. Likewise, Tang et al. [127] used ML techniques like ANN, Bayesian additive regression trees, boosted regression trees, multivariate adaptive regression splines to determine the optimum dose of immunosuppressive drug Tacrolimus. Moreover, Hu et al. [128] performed ML analysis with techniques like classification and regression trees, multilayer perceptron network, k-nearest neighbor to find out the safe initial dose of cardiac drug digoxin. In addition, Imai et al. [129] developed a DT model to find a safe starting dose of antibiotic drug vancomycin.

Predicting bioactive agents and monitoring of drug release

Designing and monitoring of drug-likeness is a tedious and time-consuming process. Lately, multiple online tools have been developed to analyze drug release and check accountability of selected bioactive compounds as a carrier. Benchmark data sets are later used to validate the computational analysis. For such evaluation’s pharmacophore based on the chemical feature suits the best. These models construct large 3D data sets developed via in silico experiments or in house compound collection [130]. To study ligand-based chemical features, various successful experiments have been established using the CATALYST program (www.accelrys.com), and a group of researchers was successful in predicting 11β-hydroxysteroid dehydrogenase type 1 inhibitors using the VS experiments [131].

Determining bioactive ligands is a crucial step for selecting a potent drug for a specific target. Now, researchers are taking advantage of artificial intelligence in determining bioactive compounds that can be used for specific targets associated with a disease. For instance, Wu et al. integrated DL and RF methods to devise WDL-RF (https://zhanglab.ccmb.med.umich.edu/WDL-RF/) for determining bioactivity of G protein-coupled receptors (GPCRs) targeting ligands. Likewise, Cichonska et al. [132] developed pairwiseMKL (https://github.com/aalto-ics-kepaco), a multiple kernel learning-based method, for determining the bioactivity of compounds [133]. To test their model's efficiency, they used to predict the anti-cancerous potency of compounds. Further, Mustapha et al. [134] developed an Xgboost model to determine bioactive chemical molecules. In addition, Merget et al. [135] created machine learning models like DNN, RF to determine the bioactivity of more than 280 different kinases. Furthermore, Arshadi et al. [136] have devised DeepMalaria, a DL-based model for identifying compounds having Plasmodium falciparum inhibitory activity. Likewise, Sugaya et al. [137] created a ligand-efficiency-driven support vector regression model to ascertain the biological activity of various chemical compounds. Moreover, Afolabi et al. [138] used data from the MLD drug data report (MDDR) repository and applied it to a combination of boosting algorithms to identify novel bioactive compounds. Additionally, Petinrin et al. [139] used the majority voting technique with an ensemble of different machine learning models to determine biologically active molecules.

Further, adverse drug reactions (ADRs) are unexpected, pernicious, fatal side effects caused by drug administration. ADRs are a major challenge in drug development, and it has become essential to identify possible ADRs during the nascent stage of drug development to make the drug development process more robust and efficacious. Lately, researchers have used AI to determine possible ADRs associated with different drugs before they are launched in the market for public use. For instance, Dey et al. [140] used DL-based model, which can predict ADRs associated with a drug and even identify chemical substructures responsible for those ADRs. In addition, Liu et al. [141] integrated chemical, biological, phenotypic properties of drugs to predict ADR associated with it via machine learning analysis. Likewise, Jamal et al. [142] combined biological, chemical, and phenotypic properties to predict nervous system ADRs linked with drugs through machine learning analysis. The authors also used their model to find out ADRs associated with current Alzheimer's drugs. Further, Xue et al. [143] integrated biomedical network topology with a DL algorithm to predict Drug-ADR correlation. Moreover, Raja et al. [144] used machine learning analysis to predict ADRs, which are a result of drug-drug interactions. They further used their model to predict ADR related to cutaneous disease drugs. Besides screening for an effective bioactive agent, another critical area to work with is drug likeliness and its interaction post-release. Recently, a freely accessible, user-friendly graphical interface SwissADME (http://www.swissadme.ch) was developed to evaluate the compatibility of the drug and its pharmacokinetic actions [145]. Mathematical models such as Higuchi, Hixson–Crowell, Ritger–Peppas–Kormeyers, Brazel–Peppas, Baker–Lonsdale, Hopfenberg, Weibull, and Peppas–Sahlin have also been applied in drug discovery, and one of the most common practice has been the calculation of drug loading capacity of the selected or screened bioactive molecule.

Prediction of protein folding and protein–protein interactions

Analyzing protein–protein interactions (PPIs) is crucial for effective drug development and discovery. Most of the protein annotation methods use sequence homology that has limited scope. High-throughput protein–protein interaction data, with ever-increasing volume, are becoming the foundation for new biological discoveries. A great challenge to bioinformatics is to manage, analyze, and model these data. Hence, computational models were developed that predicts multiple inputs at one place simultaneously [146]. Computational methods are implied to study both PPIs and protein–protein non-interactions (PPNIs), although PPIs are considered more informative than PPNIs. PPIs prediction can be identified as direct PPI, direct PPI with indirect functional associations and PPIs for signal transduction pathways [147]. Machine and statistical learning approaches like K-nearest neighbor, Naïve Bayesian, SVM, ANN, DT, and RF are used to predict the hindrance in PPIs. Use of Bayesian network (BN) has been applied to predict PPIs essentially using gene co-expression, gene ontology (GO), and other biological process similarity. Data set integration using BN produces precise and accurate PPI networks illustrating comprehensive yeast interactome [148]. Another group also used BN to combine data sets for the yeast to study PPIs [149]. A novel hierarchical model PCA-ensemble extreme learning machine (PCA-EELM) to predict protein–protein interactions only using protein sequences information has appeared as a powerful tool that gives output with accuracy and less duration [150]. Further, DNNs PPIs prediction efficiency was improved by a novel method known as DNN for protein–protein interactions prediction (DeepPPI) (http://ailab.ahu.edu.cn:8087/DeepPPI/index.html) [151]. In mammalian cells, signal transduction is mostly controlled by PPIs between unstructured motifs and globular proteins binding domains (PBDs). To predict these PBDs across multiple protein families bespoke ML tool was developed, known as hierarchical statistical mechanical modeling (HSMM) [152]. Prediction of protein–protein interactions based on ML, domain-domain affinities and frequency tables, a novel tool referred to as PPI_SVM, was developed in 2011, which is freely accessible at (http://code.google.com/p/cmater-bioinfo/) [153]. Due to the increased number of solved complex structures, a multimeric threading approach, MULTIPROSPECTOR, has been developed. In this method, proteins with known template structures are rethreaded, and their interaction with other proteins, their interfacial energy, and Z-score are established [154]. Structure-based threading logistic regression tool Struct2Net (http://struct2net.csail.mit.edu) to evaluate the probability of interaction is the first structure-based PPI predictor apart from homology modeling [155]. Gene cluster-based methods calculate the co-occurrence probability of orthologs of query proteins encoded from the same gene clusters. This method is also named domain/gene co-occurrence. If two proteins’ genes are not close by in the genome, then this method cannot reliably predict an interaction between these two genes [156, 157].

Structure-based and ligand-based virtual screening

In drug designing and drug discovery, VS is one of the crucial methods of CADD. VS refers to the identification of a small chemical compound that binds to a drug target. VS is an efficient method to screen out the promising therapeutic compound from a pool of compounds [158]. Thus, it becomes an important tool in high-throughput screening, which incurred the problem of high-cost and low-accuracy rate. In general, there are two important types of VS that are structure-based VS (SBVS) and ligand-based VS (LBVS) [159, 160]. The LBVS depends on the chemical structure and empirical data of both active and inactive ligands, which uses the chemical and physiochemical similarities of active ligands to predict the other active ligand from a pool of compounds with high bioactivity. However, the LBVS does not depend on the 3-D structure of the target protein, and thus, this method is implemented where target structure or information is missing, and the obtained structural accuracy is low [161]. On the other hand, SBVS has been implemented in such cases where 3-D structural information of protein or target has been elucidated either through in vitro or in vivo experiments or through computational modeling [162, 163]. In general, this method is used to predict the interaction between the active ligand or its associated target and to predict the amino acid residues, which are involved in drug-target binding. In comparison with LBVS, SBVS possesses high accuracy and precision. However, SBVS is associated with the problem of an increasing number of disease-causing proteins and their complicated conformations [164]. To use ML for VS, there should be a filtered training set comprising of known active and inactive compounds. These training data are used to train a model using supervised learning techniques. The trained model is then validated, and if it is accurate enough, the model is used on new data sets to screen compounds with desired activity against a target [165]. After that, the shortlisted compounds can go for ADMET analysis, followed by various bioassays before entering clinical trials. Hence, ML has the power to speed up VS, make it more robust, and can even reduce false positives in VS. Docking is the main principle applied in SBVS, where several AI and ML-based scoring algorithms have been developed such as NNScore, CScore, SVR-Score, and ID-Score [166]. Similarly, ML and DL methods such as RFs, SVMs, CNNs, and shallow neural networks have been constructed to predict protein–ligand affinity in SBVS. Moreover, AI-based algorithms have been developed for molecular dynamic simulation assays in SBVS [167]. On the other hand, LBVS consists of several steps, and each step comes up with novel AI- and ML-based algorithms to speed up the process and increase reliability. For example, several ML- and DL-based algorithms have been constructed for the preparation of useful decoy sets such as Gaussian mixture models (GMMs), isolation forests, and artificial neural networks (ANNs).

Further, ML models such as PARASHIFT, HEX, USR, and ShaPE algorithms have been constructed for LBVS [168, 169]. Currently, with the rise of AI algorithms in the healthcare and pharma industry, different tools and models have been developed for both LBVS and SBVS. For example, tools such as MTiOpenScreen (http://bioserv.rpbs.univ-paris-diderot.fr/services/MTiOpenScreen/) [170], FlexX‐Scan [171], CompScore (http://bioquimio.udla.edu.ec/compscore/) [172], PlayMolecule BindScope (PlayMolecule.org) [173], GeauxDock (http://www.brylinski.org/geauxdock) [174], EasyVS (http://biosig.unimelb.edu.au/easyvs) [175], DEKOIS 2.0 [176], PL-PatchSurfer2 (http://www.kiharalab.org/plps2/) [177], SPOT-ligand 2 (http://sparks-lab.org/) [178], Gypsum-DL (https://durrantlab.pitt.edu/gypsum-dl/) [179], and ENRI [180] have been developed for SBVS. Moreover, mounting evidence validates the hypothesis that AI plays a critical role in SBVS, such as identification of non-peptide cysteine-cysteine chemokine receptor 5 receptor agonists [181], screening of partial agonists of the β2 adrenergic receptor [182], identification of bromodomain-containing protein 4 inhibitors [183], discovery of natural product-like signal transducer and activator of transcription 3 dimerization inhibitor [184], prediction of VHL and hypoxia-inducible factor 1-alpha inhibitors [185], and prediction of Kelch-like ECH-associated protein-nuclear factor erythroid 2-related factor 2 (Keap-Nrf2) small-molecule inhibitors [186]. Likewise, Liu et al. 2017 discovered low toxicity O-GlcNAc transferase inhibitors, whereas Dou et al. [187] identified novel glycogen synthase kinase 3 beta (GSK-3β) inhibitors through SBVS [188]. Different studies were conducted on cancer and leukemia through SBVS, such as the discovery of novel GSK-3β for treatment of acute myeloid leukemia [189], identification of novel protein arginine methyltransferase 5 inhibitor in non-small cell lung cancer [190], identification of vascular endothelial growth factor receptor 2 potent compounds for the treatment of renal cell carcinoma [191], identification of multi-targeted inhibitors against breast cancer [192], and discovery of Mdm2-p53 inhibitor [193]. Recently, novel corona virus became a huge problem worldwide, and thus, here also SBVS provides a great opportunity for chemical and biological scientists to identify novel drug compounds against disease-causing targets. For example, Gahlawat et al. 2020 identified that saquinavir, lithospermic acid, and 11m_32045235 were promising therapeutic compound against SARS-Cov-2 main protease, whereas Selvaraj et al. 2020 demonstrated that TCM 57,025, TCM 3495, TCM 5376, TCM 20,111, and TCM 31,007 were therapeutic compounds that interact with the substrate-binding site of N7-MTase [194, 195]. On the same trend, Cruz et al. 2018 concluded that ZINC91881108 was potent compound against RIPK2, whereas Simoben et al. 2018 demonstrated eight novel N-(2,5-dioxopyrrolidin-3-yl)-n-alkylhydroxamate derivatives as smHDAC8 inhibitors with IC50 values ranging from 4.4 to 20.3 µM against smHDAC8 [196, 197] [Fig. 4].

Fig. 4
figure 4

a Ligand-based virtual screening: in the drug design and discovery process, ligand-based virtual screening is the most crucial step, which comprises different steps as shown in the figure. The initial step consists of database screening and the 3-D structural model's prediction through the active site for a special target and X-ray structure of complexes. Later on, pharmacophore modeling of selected compounds with selected features is performed, followed by pharmacophore and docking-based virtual screening of compounds. The screened compounds are subjected to different toxicity and physiochemical properties for further analysis. Finally, the lead compounds are subjected to in vitro and in vivo bioassays for validation. b structure-based virtual screening: it is another type of virtual screening applied in the drug discovery process, where target structure preparation and chemical compound library preparation are initial steps. Afterward, structural analysis and binding site prediction are done, followed by molecular docking of compounds with the selected target. Later on, molecular dynamics simulation studies are carried out to validate the screened compounds in silico, followed by experimental validation through bioassays

Moreover, different algorithms and tools have been developed for LBVS such as SwissSimilarity (http://www.swisssimilarity.ch/) [198], METADOCK [199], Open-source platform [200], HybridSim-VS (http://www.rcidm.org/HybridSim-VS/) [201], PKRank [202], PyGOLD (http://www.agkoch.de/) [203], BRUSELAS (http://bio-hpc.eu/software/Bruselas) [204], RADER (http://rcidm.org/rader/) [205], QEX [206], IVS2vec (https://github.com/haiping1010/IVS2Vec) [207], AutoDock Bias (http://autodockbias.wordpress.com/) [208], Ligity [209], D3Similarity (https://www.d3pharma.com/D3Targets-2019-nCoV/D3Similarity/index.php) [210], and GCAC (http://ccbb.jnu.ac.in/gcac) [211]. Emerging evidence suggests the potential implementation of AI algorithms in LBVS such as identification of aurora kinase A inhibitors [212], G-quadruplex-targeting chemotypes [213], PI3Kα inhibitors [214], targeting dengue virus non-structural protein 3 helicases [215], potential selective histone deacetylase 8 inhibitors [216], and novel p-Hydroxyphenylpyruvate dioxygenase inhibitors [217]. Apart from these mentioned studies number of literature validated the possible implementation of AI in LBVS, such as identification of HIV entry inhibitors and potent inhibitors of DNA methyltransferase [218, 219]. Like SBVS, LBVS also plays a crucial role in identifying potential therapeutic compounds against novel human coronaviruses. For example, Amin et al. 2020 demonstrated the molecular docking study of some in-house molecules as papain-like protease inhibitors, whereas Hofmarcher et al. 2020 through DNN identified 30,000 compounds from the library across 3.6 M compounds as CoV-2 inhibitors [220, 221]. Similarly, Choudhary et al. 2020 identified SARS-CoV-2 cell entry inhibitors, whereas Ferraz et al. 2020 identified bedaquiline, glibenclamide, and miconazole as potential therapeutic compounds against coronavirus [222, 223]. Xiao et al. 2018 developed ligand-based big data DNN models for VS of compound libraries against six anti-cancer targets. The study integrated 0.5 M chemical compounds, and the models developed were evaluated by tenfold cross-validation [224]. With the growing size of chemical compound libraries, it is become so difficult to find a potential hit and it is like finding a “needle in a haystack.” Thus, SBVS and LBVS have huge role in minimizing the complexity in identification of potential therapeutic compounds against the disease-causing target. Further, AI-based models in SBVS and LBVS make it simpler with high accuracy and precision. Table 1 discusses the different AI- and DL-based web tools and algorithms implemented in LBVS and SBVS.

Table 1 Application of artificial intelligence (AI) algorithms including machine learning (ML) and deep learning principles in structure and ligand-based virtual screening

QSAR modeling and drug repurposing

In drug designing and discovery, it is crucial to develop the relationship between chemical structures and their physiochemical properties with biological activities. Thus, QSAR modeling is a computational approach through which quantitative mathematical models can be created between chemical structure and biological activities. The main advantage of developing a mathematical model is identifying the diverse chemical structure from molecular databases, which can be used as therapeutic compounds against a disease target. Once the most promising compound is selected, it is subjected to laboratory synthesis and in vitro or in vivo testing. QSAR models are broadly classified into two types that are regression model and classification models. Gaussian processes (GPs) are a type of QSAR building regression model, which is a robust and powerful method of QSAR modeling. GP methods can handle a large number of descriptors and identify the crucial ones. Recently, two classification models have been demonstrated using GP that is intrinsic GP classification methods, and the other is a combination of GP regression technique and probit analysis [235, 236]. Further, the method is suitable for modeling nonlinear relationships and does not require subjective determination of the model parameters [237]. Recent advancements and increasing applications of ML algorithms such as neural networks, DL, and SVM provide a great avenue for QSAR modeling. Several web-based tools and algorithms have been developed for QSAR modeling such as VEGA platform (https://www.vega-qsar.eu/) [238], QSAR-Co (https://sites.google.com/view/qsar-co) [239], FL-QSAR (https://github.com/bm2-lab/FL-QSAR) [240], Meta-QSAR (https://github.com/meta-QSAR/simple-tree) (https://github.com/meta-QSAR/drug-target-descriptors) [241], DPubChem (www.cbrc.kaust.edu.sa/dpubchem) [242], Transformer-CNN (https://github.com/bigchem/transformer-cnn) [243], Cloud 3D-QSAR (http://chemyang.ccnu.edu.cn/ccb/server/cloud3dQSAR/) [244], MoDeSuS and Chemception (https://github.com/Abdulk084/Chemception) [245]. Karpov et al. 2020 developed a novel algorithm for QSAR modeling based on ANN called transformer-CNN. The method uses SMILES augmentation for training and interference. Similarly, Wang et al. 2020 developed QSAR modeling web-based tools by integrating the characteristics features of molecular structure generation, alignment, and molecular interaction field. Jin et al. through Cloud 3D-QSAR discovered a potent and selective monoamine oxidase B (MAO-B) inhibitor. In this study, the authors concluded that (S)-1-(4-((3-fluorobenzyl)oxy)benzyl)azetidine-2-carboxamide (C3) were more potent and selective inhibitor of MOB as compared to safinamide. Further, in vivo analysis revealed that compound C3 could inhibit cerebral MAO-B activity and rescue 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP)-induced dopaminergic neuronal loss [246]. On the same trend, Bennett et al. 2020, through Chemception, predicted the small molecules transfer free energy by combining MD simulations and DL [81]. Moreover, the QSAR-Co tool was implemented in different studies such as the development of multi-target chemometric models for the inhibition of class I phosphoinositide 3-kinases enzyme isoforms, screening of ERK inhibitors as anti-cancer agents, prediction of K562 cells functional inhibitors, and prediction of antifungal properties of phenolic compounds [247,248,249,250]. Likewise, Kim and Cho 2018 developed a novel algorithm called PyQSAR (https://github.com/crong-k/pyqsar_tutorial) for a fast QSAR modeling platform using ML and Jupyter notebook. PyQSAR is a standalone python package that combines all QSAR modeling processes in a single workbench [251]. A. S. Geoffrey et al. 2020 conducted two different studies using PyQSAR, such as identification of potent drug candidates for novel coronavirus and development of QSAR of quercetin and its tumor necrosis factor-alpha inhibition activity [252, 253]. Further, Zuvela et al. developed ANN-based QSAR models for prediction of antioxidant activity of flavonoids. In this study, the authors integrated six methods such as PaD, PaD2, weights, stepwise, perturbation, and profile for interpretation and elucidation of ANN-based models, which calculates trolox-equivalent antioxidant properties. The results concluded that the ANN-based algorithm could eliminate the difficulties that arise due to poor interpretation of quantum mechanical parameters describing the molecular structure [254]. In parallel, Ding et al. 2020 generated a web-based tool known as VISAR (https://github.com/Svvord/visar) for dissecting chemical features through the DNN QSAR approach [255]. The mounting evidence demonstrates the implementation of QSAR modeling in drug designing and discovery process such as modeling of ToxCast assays relevant to the molecular initiating events of AOPs in Hepatic Steatosis [256], development of dipeptidyl peptidase 4 inhibitors against dipeptidyl peptidase 8 and dipeptidyl peptidase 9 enzymes [257], the applicability of QSAR model on domain analysis of HIV-1 protease inhibitors [258], and targeting HIV/HCV coinfection [259]. A well-recognized problem of ML models is data imputation for missing values in the bioassay data for SAR model generation. Basically there are three major types of missing values: (i) Missing Completely at Random (MCAR), which occurs when the probability of missing values in a variable is the same for all samples; (ii) Missing at Random (MAR), which means that probability of missing values, at random, in a variable depends only on the available information in other predictors; (iii) Missing Not at Random (MNAR), which means when probability of missing values is not random and depends on the information which is not recorded and the existing information predicts the missing values [260]. There are several ways to handle missing values like imputation using zero, mean, median or mode common value, imputation using a randomly selected value, imputing with a model or imputation using Deep Learning Library–Datawig. Every data set has missing values that need to be handled wisely in order to build a robust model [261]. Moreover, the complexity of data should be removed, and data must be curated to increase the accuracy and precision of the models generated. Moreover, initially QSAR models were implemented for predicting the toxicity and metabolism of small molecules such as molecules having molecular weight (mw) less than 1500 m.w. However, the QSAR technology applied in the early 2000s comes with some sort of constraints such as accuracy and reliability [262]. With the growing application of QSAR in drug discovery and design process such as VS, lead optimization, and target identification medicinal scientists and biologist were in constant efforts for development of more reliable and dependable approaches [263]. AI/ML algorithms-based QSAR models have potential to eliminate the constraints imposed by early methods. AI/ML-based QSAR model, namely hologram-based QSAR (HQSAR), group-based QSAR (G-QSAR), and Ensemble-based, have accelerated the drug discovery process by several folds [264, 265]. Further, apart from classical Hansch and Free-Wilson approaches, QSAR has gradually evolved over the past few years with newer refinement approaches, new methods for descriptors calculations, implementation of methodical validation tests, and involvement of receptor structural information. Similarly, apart from classical lead optimization, QSAR have been applied in different emerging areas of drug discovery and designing such as peptide QSAR, mixture toxicity QSAR, nanoparticles QSAR, QSAR of ionic liquids, cosmetic QSAR, phytochemical QSAR, and material informatics [266] [Fig. 5].

Fig. 5
figure 5

source is performed. Later on, evaluation of repositioning models through cross-validation, case analysis, and evaluation metrics is performed. Finally, validation of repurposed drugs is carried out through clinical trials, in vitro studies, and in vivo studies

a Quantitative structure–activity relationship workflow: the initial step comprises of data set compilation, where data from public database and literature database are accumulated and compiled, which further divided into different subsets for investigation. Afterward, data set processing is performed, where data pre-processing and curation followed by calculation of molecular descriptors are done. After description calculation, data set processing normalization of data and splitting of data into different sets are performed. In the third step, model construction is performed, where data sets such as internal data and external data are accumulated, and learning algorithms are applied for QSAR modeling. Finally, the statistical calculation is done to measure the model robustness. The final step in the quantitative-structure activity relationship is model evaluation, where the model is evaluated by comparison from previous benchmark models, identifying characteristics features, performance evaluation, and interpretation of essential features. b Drug repurposing or repositioning workflow: the first step is collection of data and data pre-processing followed by computational model generation. The models generated are support vector machines, logistic regression, random forest, deep learning, and matrix factorization. Afterward, the generation of proof-of-concept from a literature

Apart from QSAR modeling, the AI algorithm has also been implemented in drug repurposing or drug repositioning method. In drug designing and discovery, drug repositioning refers to the investigation of drugs that have already been developed for one diseased condition and reposition them for other diseased conditions. Repositioning drugs might be successful due to the possibility of multiple-target involvement in multiple diseases [267,268,269]. On another note, the emergence of large data sets from genomics, proteomics, and pharmacological in vivo and in vitro studies provides a great avenue for drug repositioning. Recently, the emergence of AI-based tools and algorithms in drug discovery provides a platform for future research. ML algorithms replace the chemical similarity and molecular docking-based conventional methods with new system biology methods, which can evaluate drug effects [270,271,272,273]. Thus, different AI-based algorithm and web-based tools have been developed in recent times such as DrugNet (http://genome2.ugr.es/drugnet/) [274], DRIMC (https://github.com/linwang1982/DRIMC) [275], DPDR-CPI (http://cpi.bio-x.cn/dpdr/) [276], PHARMGKB (https://www.pharmgkb.org/) [277], PROMISCUOUS 2.0 (http://bioinformatics.charite.de/promiscuous2) [278], and DRRS (http://bioinformatics.csu.edu.cn/resources/softs/DrugRepositioning/DRRS/index.html) [279]. Moreover, Yella and Jegga et al. 2020 constructed a model for drug repositioning using a multi-view graph attention approach known as MGATRx [280], whereas Yan et al. 2019 constructed a novel algorithm for drug repurposing based on a multisimilarity fusion approach known as BiRWDDA [281]. Further, Fahimian et al. 2020 constructed a novel algorithm known as RepCOOL to identify promising repurposed drugs for breast cancer stage II. The results concluded that doxorubicin, paclitaxel, trastuzumab, and tamoxifen were potential therapeutic agents against breast cancer stage II [282]. Likewise, Li et al. 2020 constructed a computational framework of host-based drug repurposing for broad-spectrum antivirals against RNA virus. In this study, the authors investigated 2352 approved drugs and 1062 natural compounds against different viral pathogens and concluded that the repurposed drugs were effective against zika virus and coronavirus [283]. Further, Wu et al. 2020 applied ML models, namely structural profile prediction model and biological profile prediction model, to predict anti-fibrosis drug candidates. The results demonstrated that the area under the receiver operating characteristics curve were 0.879 and 0.972 in the training set, whereas 0.814 and 0.874 in the testing set. The results concluded that natural products possess anti-fibrosis characteristics and serve as potential anti-fibrosis drug targets [284]. Recently, COVID-19 emerged as a global pandemic and researchers around the globe started the hunt for promising therapeutic agents. In this regard AI-based drug repositioning plays a crucial role. For example, network-based drug repurposing identified 16 potential anti-HCoV repurposable drugs, whereas Hooshmand et al. 2020 identified 12 promising drug targets for COVID-19 based on the multimodal DL approach [285, 286]. In recent times, the development of neural networks, DL models, and pipelines for drug repositioning have increased to a great extent. For example, SNF-CVAE based on drug similarity network fusion identified promising therapeutic agents for Alzheimer’s disease (AD) and juvenile rheumatoid arthritis, whereas DTI-RCNN based on neural network algorithm and integrates long short-term memory predicts drug-target interactions [287, 288]. PhenoPredict and SDTNBI are two other ML-based algorithms used to identify disease phenome-wide drug repositioning for schizophrenia and prediction of drug-target interactions, respectively [289, 290]. Zang et al. 2019 developed a DL-based model known as deepDR (https://github.com/ChengF-Lab/deepDR) to predict in silico drug repositioning. In the study, the authors integrate 10 different types of biological networks such as drug-disease, drug-side effects, drug-target, and seven drug-drug networks. The results concluded that deepDR predicted approved drugs such as risperidone and aripiprazole for the treatment of Alzheimer's disease (AD), whereas methylphenidate and pergolide for treatment of Parkinson's disease (PD) [291]. Likewise, Chen et al. 2020 constructed an AI-based novel algorithm called as iDrug (https://github.com/Case-esaC/iDrug) for the integration of drug repositioning and drug-target prediction through cross-network embedding. The efficiency and effectiveness of iDrug allow users to understand novel clinical insights of drug-target-disease mechanisms [292]. Studies demonstrated that drug repurposing through an AI-based algorithm can be implemented in cancer. For example, Li et al. 2020 integrated transcriptomics data and chemical structure information using DL and identified that pimozide as a promising therapeutic candidate against non-small cell lung cancer [293]. Similarly, Kuenzi et al. 2020 predicted drug response and synergy using a DL model of human cancer cells. The results concluded that predicted combinations improve progression-free survival, and response predictions stratify ER-positive breast cancer patient clinical outcomes [294]. Another AI application in drug repurposing comes from the study performed by Wang et al. 2020, which used bipartite graph convolutional networks for in silico drug repurposing. The authors constructed a model known as BiFusion (https://github.com/zcwang0702/BiFusion) through DL and heterogeneous information fusion. The results demonstrated that BiFusion achieved improved performance than multiple baselines for drug repurposing [295]. The examples mentioned above concluded the potential role of AI-based algorithms in drug repurposing. Further, with the advancement in technology, chemical scientists, biological scientists, and computational scientists search the methods for improving the accuracy and precision of AI-based models. Moreover, both QSAR and drug repositioning methods of drug discovery are incomplete without the involvement of molecular docking, which is used to analyze the interaction between the target molecule and a ligand molecule. Initially, in the early 2000s molecular docking was developed as a standalone tool that is used to determine the interaction between two molecules that is a target molecule and a ligand molecule. However, with the advent of AI technology the applicability of molecular docking has changed. Now molecular docking is being used in conjugation with MD simulation and AI-based tools in different areas of drug discovery like VS, target identification, polypharmacology, and drug repurposing [296]. The implementation of MD simulation and AI-based algorithms can increase the efficiency and accuracy of molecular docking. In addition, over the years, limitations in the use of molecular docking have also been addressed. For instance, in drug designing, molecular docking can be used only for those biological targets whose crystal structures are available as there are many targets whose structures are not available. Thus, a technique like homology modeling has been developed to overcome this hindrance [297]. Further, crystal structure data in PDB are increasing exponentially, enhancing the applicability of molecular docking in drug discovery. Table 2 discusses the tools and algorithm that have been implemented in in silico QSAR and drug repositioning.

Table 2 Application of artificial intelligence (AI) algorithms including machine learning (ML) and deep learning principles in drug design and discovery process

Prediction of physicochemical properties and bioactivity

It is a well-established fact that every chemical compound is associated with physicochemical properties such as solubility, partition coefficient, ionization degree, permeability coefficient, which may hinder the pharmacokinetic properties of the compound and drug-target binding efficiency. Thus, the physicochemical properties of compounds must be considered while designing a novel drug molecule [100, 298]. For this, different AI-based tools have been developed to predict the physicochemical properties of chemical compounds. The AI-based tools developed for predicting biophysical and biochemical properties of compounds include molecular fingerprinting, a SMILES format, Coulomb matrices, and potential energy measurements, which are used in the DNN training phase [299, 300]. Recently, Zhang et al. developed a QSAR model to predict the six different physiochemical properties of environmental agents extracted from environmental protection agency (EPA). Similarly, Lusci et al. 2013 constructed a neural network-based model to predict the molecular properties. In the study, molecules are described by undirected cyclic graphs, whereas the former approaches for predicting physicochemical properties use directed acyclic graphs [301]. Later on, six AI-based algorithms were constructed for the prediction of human intestinal absorption of compounds. The methods constructed are SVM, k-nearest neighbor, probabilistic neural network, ANN, PLS, and linear discriminate model. Among the above-said models, SVM has higher accuracy of 91.54% [302]. In 2016, Zang et al. developed an ML-based model for the prediction of physicochemical properties such as octanol–water partition coefficient, water solubility, boiling point, melting point, vapor pressure, and bioconcentration factors of environmental chemicals [303]. Moreover, different AI-based tools have been developed such as ALOGPS 2.1 (http://www.vcclab.org/lab/alogps/) [304], ASNN (http://www.vcclab.org/lab/asnn/) [305], E-BABEL (http://www.vcclab.org/lab/babel/) [304], PCLIENT (http://www.vcclab.org/lab/pclient/) [304], E-DRAGON (http://www.vcclab.org/lab/edragon/) [304], ChemSpider (http://www.chemspider.com/) [306], SPARC (http://sparc.chem.uga.edu/sparc/) [307], and OSIRIS property explorer (https://www.organic-chemistry.org/prog/peo/) [308]. In 2020, a study was conducted to design, synthesize, and ADMET prediction of bis-benzimidazole as anticancer agents. In the same study, the author calculated molecular properties of compounds through Lipinski’s rule of five and predicted the pre-ADMET properties of the synthetic compounds [309]. Further, Puratchikody et al. 2016 used ORISIS property explorer in their study to predict the quantitative structural toxicity of tyrosine derivates intended for safe, potent inflammation treatment. The results concluded that out of 55 potent molecules, only 19 molecules were considered as potent cyclooxygenase-2 inhibitors [310]. On similar lines, RF- and DNN-based models were constructed to predict human intestinal absorption of different chemical compounds. Thus, from the examples, it must be concluded that the AI-based approach has a significant role in drug discovery and development through the prediction of physicochemical properties.

Moreover, the therapeutic activity of drug molecules depends on their binding efficiency with the receptor or target, and thus, the chemical molecule, which are not able to show the binding affinity with the drug target, will not be considered as a therapeutic agent. For this reason, the prediction of the binding affinity of a chemical molecule with the therapeutic target is vital for drug discovery and development [311]. Recent advancements in AI algorithms enhance the process of binding affinity prediction, which uses similarity features of the drug and its associated target. Several web-based tools have been developed, such as ChemMapper and the similarity ensemble approach (SEA). Further, ML- and DL-based models for the identification of drug-target affinity have been constructed, such as KronRLS, SimBoost, DeepDTA, and PADME [312]. The KronRLS predicts the similarity between a drug and its target to calculate the drug-target binding affinity based on the ML algorithm. KronRLS considered both feature-based and similarity-based interaction while predicting drug-target binding affinity [313]. DL approaches such as DeepDTA (https://github.com/hkmztrk/DeepDTA) [314], and PADME [315] predict drug-target binding affinity, which depends on the 3-D structure of a protein. Beck et al. 2020 conducted a study to predict commercially available antiviral drugs as a potential therapeutic agent against novel coronavirus (SARS-CoV-2) through DeepDTA [316]. Similarly, Lee and Kim 2019 predicted the drug-target interactions by DNN based on large-scale drug-induced transcriptome data using PADME [317]. Another DL model that uses both RNN and CNN was constructed to predict drug-target binding affinity, which is called as DeepAffinity (https://github.com/Shen-Lab/DeepAffinity) [318]. Jiang et al. 2019, using DeepAffinity, proposed a novel protein descriptor for identifying drug-target interaction, whereas Born et al. 2020 with the help of Deep Affinity, identified antiviral candidates for SARS-CoV-2 [319, 320]. The above data validate the importance of ML and DL algorithms in physiochemical properties and bioactivity of drug molecules during drug designing. However, the validation and accuracy of such algorithms are still a significant drawback from a research perspective. Thus, extensive research should be done to maximize the accuracy and precision of AI-based algorithms through curated and extensive data input. In Table 2, we have summarized the tools and databases for physiochemical and bioactivity prediction based on AI algorithms, including DL, neural networks, SVM, and others.

Prediction of mode of action and toxicity of compounds

Drug toxicity refers to the chemical molecule's adverse effect on an organism or on any part of the organism due to the compound's mode of action or metabolism. The extended scope of AI has the potential to predict the off-target and on-target effects of drug molecules along with in vivo safety analysis of chemical compounds before their synthesis has fascinated the scientists associated with the drug development process. The involvement of AI has reduced drug development time, cost, attrition rates, and human resources. For this different web-based tools have been developed such as LimTox (http://limtox.bioinfo.cnio.es/) [321], pkCSM (http://biosig.unimelb.edu.au/pkcsm/) [322], admetSAR (http://lmmd.ecust.edu.cn/admetsar2/) [323], and Toxtree (http://toxtree.sourceforge.net/) [324]. Srivastava et al. 2020 used admetSAR to evaluate the toxicity of Withania somnifera as a therapeutic compound against COVID-19, whereas Uygun et al. 2021 incorporated pkCSM for the identification of the therapeutic effect and toxicological properties of pyrazolo[1,5-a]pyrazine-4(5H)-one derivative on lung adenocarcinoma cell line [325, 326]. Advancements in AI-based approaches led to the development of different toxicity prediction software and web-based tools such as Tox21 (https://ntp.niehs.nih.gov/whatwestudy/tox21/index.html) [327], SEA (http://sea.bkslab.org/) [328], eToxPred (https://www.brylinski.org/etoxpred-0) [329], and TargeTox (https://github.com/artem-lysenko/TargeTox) [330]. Tox21 evaluates the toxicity of 12,707 environmental compounds and drugs, whereas SEA forecasts the toxicity of 656 marketed drugs against 73 unintended targets. TargeTox predicts toxicity risk based on the target-drug biological network. In 2016, Huang et al. predicted the in vivo toxicity profile and mechanism characterization of more than 10,000 chemical compounds through modeling Tox21, whereas, in the same year, Zhou et al. predicted the cancer-relevant proteins using an improved molecular SEA [331, 332]. Further, Gupta and Rana. 2019 employed eToxPred to predict the toxicity of small molecules of androgen receptor. The authors incorporated 1444 characteristics features of small molecules on 10,273 drugs in which 461 are considered as active and 9812 are inactive [333].

DeepTox (http://bioinf.jku.at/research/DeepTox/tox21.html) [334] and PrOCTOR (https://github.com/kgayvert/PrOCTOR) [335], are used for prediction of toxicity of new compounds and prediction of the toxicity probability in clinical trials, respectively. For example, Robledo-Cadena et al. 2020 predicted the effect of non-steroidal anti-inflammatory drugs on cisplatin, paclitaxel, and doxorubicin efficacy against cervix cancer cells using PrOCTOR, whereas Gilvary et al. 2020 identified the novel indications for 2,576 small molecules incorporated with 16 different drug features for PD and Type 2 diabetes [336, 337]. Similarly, using DeepTox, Simm et al. 2018 analyzed and repurposed high-throughput imaging assay data to predict the biological activity of different chemical compounds that were targeting alternative biological pathways and processes [338]. Furthermore, DeepTox was used for the development of several ML and DL algorithms, which predicts the toxicity properties and chemical characteristics features of drug compounds such as SMILES2Vec (predicts chemical properties) [339], Chemception (DNN-based prediction of chemical properties) [245], DeepSynergy (prediction of anti-cancer drug synergy with DL) [340], and deepAOT (prediction of compound acute oral toxicity) [341]. However, the accuracy and precision of DeepTox and PrOCTOR could be increased by using large and refined data sets, which could be achieved with the pharmaceutical industry's involvement. Recently, other ML-based tools such as SPIDER [342] and read-across structure–activity relationships (RASAR) [343] were developed, which are capable of analyzing β-lapachone targets and linking molecular structures and toxic properties of an unknown compound, respectively.

Zhang et al. [344] developed different toxicity predictive models for drug-induced liver toxicity based on five ML algorithms combined with MACCS or FP4 fingerprinting. The results demonstrated that the best model yielded an accuracy rate of 75% against an external validation data set [344]. Similarly, several toxicity evaluation algorithms were constructed based on ML methods such as relevance vector machine (RVM), regularized-RF, C5.0 trees, eXtreme gradient boosting (XGBoost), AdaBoost, SVM boosting (SVMBoost), RVM Boosting (RVMBoost). The constructed models were used to evaluate rat oral acute toxicity, respiratory toxicity, and urinary tract toxicity [345,346,347,348]. In recent years, the execution of deep-learning algorithms has led to novel approaches for the molecular representation of chemical compounds, making DL methods suitable for predicting compound toxicity. Further, the potential for DL algorithms for toxicity prediction depends on the quality and quantity of data sets. In short, more research should be done to make AI-based algorithms reliable for toxicity prediction. However, the current ML-based predictors remain inappropriate to replace biological systems, but they are sufficient to extend the medicinal chemistry principles in the right direction, which reduces the number of synthesis cycles. Further, the detailed description of toxicity prediction AI-based algorithms and tools is discussed in Table 2.

Identification of molecular pathways and polypharmacology

One of the significant outcomes of AI and ML algorithms in drug discovery and development is the prediction and estimation of overall topology and dynamics of disease network or drug-drug interaction or drug-target relationships [349]. This methodology offers a vast avenue for the identification of novel molecular therapeutic targets for a particular disease. Text mining-driven databases like DisGeNET, STITCH, STRING are widely used to ascertain gene-disease associations, drug-target associations, and molecular pathways, respectively. For instance, Gu et al. 2020 used the similarity ensemble approach to identify targets for 197 most commonly used Chinese herbs. Later, the DisGeNET database was used to associate those drug targets with different diseases, thus linking herbs with diseases in which they can be used [350]. Further, chen et al. 2019 used the STITCH database to find targets of potential drugs shortlisted for esophageal carcinoma [351]. Likewise, Taha et al. 2020 used the STITCH database to find targets for active constituents of Nandina domestica, a plant used for treating various tumors. Later STRING database was used to construct compound-target pathways with the help of the cytoscape tool [352].

In medicinal chemistry, polypharmacology refers to designing a single drug molecule capable of interacting with multiple targets in a disease-related drug-target biological network. It is best suited for designing a promising therapeutic agent for more complex diseases such as cancer, neurodegenerative disease (NDDs), diabetes, heart failure, and many others [353,354,355]. ML-based methods have the potential to analyze guilt-by-association molecular networks due to strong mining capabilities and data analysis. Further, ML models assist in the rational design of multitarget ligand through the generation of chemical compounds with desired polypharmacological features as ML models generate a vast number of chemical structures with different chemical and topological features. Thus, the probability of discovering multi-target ligands increases. Furthermore, ML models help in the identification of multi-target ligands, where there are dissimilar binding pockets. Recent advancements in AI in drug discovery and development have led to the generation of web-based tools and stand-alone software packages for polypharmacology prediction such as polypharmacology browser (PPB) (http://www.gdb.unibe.ch/) [356], TarPred (http://www.dddc.ac.cn/tarpred/) [140], Self-Organizing Map Based Prediction of Drug Equivalence Relationship (SPiDER) (http://modlabcadd.ethz.ch/software/spider) [357], Targethunter (https://www.cbligand.org/TargetHunter3D/) [358], PharmMapper (http://lilab-ecust.cn/pharmmapper/) [359], ChemMapper (http://lilab.ecust.edu.cn/chemmapper/) [360], and Swiss Target Prediction (SwissTargetPrediction) (http://www.swisstargetprediction.ch/) [361]. Poirier et al. 2018 conducted an experiment using PPB for the identification of lysophosphatidic acid acyltransferase β as a therapeutic target of nanomolar angiogenesis, whereas Ozhathil et al. 2018 identified potent and selective small-molecule inhibitors of cation channel transient receptor potential cation channel subfamily M member 4 using PPB [362, 363]. Further, Vleet Van et al. 2018 implemented the TarPred tool for screening strategies and methods for improved off-target liability prediction, whereas, in the same year, Ratnawati et al. predicted the active compounds from SMILES codes using backpropagation algorithm [364, 365]. Among the above said web-based tools PharmMapper and ChemMapper were frequently used for current research. For example, synergistic mechanism of huangqi and huanglian for Diabetes Mellitus [366], investigation of blood enriching mechanism of danggui buxue decoction [367], and prediction of multiple mechanisms of Hedyotis diffusa Willd. On Colorectal Cancer [368], used PharmMapper. Similarly, identification of human copper trafficking blocker in cancer [369], identification of multi-target ligands through chemical-protein interaction in AD [370], prediction of the anticancer mechanism of Kushen Injection against Hepatocellular carcinoma [371], and discovery of Pteridin-7(8H)-one-Based as therapeutic compound against epidermal growth factor receptor kinase T790M/L858R mutant [372], were performed using ChemMapper. One major limitation of AI algorithms for polypharmacology prediction is inadequate data or reliability of the data set. Thus, quantum chemical calculations, which provide fine-tuned data set, should be done and, thus, which can increase the accuracy of a predictive model.

Moreover, AI in drug development opened the gates for identifying molecular pathways or molecular targets for the treatment of human disease through genomics information, biochemical features, and target specifications [373]. “OpenTargets” (https://www.opentargets.org/) [374], a freeware and ML-based tool, used for prioritizing potential therapeutic drug targets with over 71% accuracy. Recently, Nabirotchkin et al. identified the unfolded protein response and autophagy-related pathways of common approved drugs against COVID-19, whereas Lopez-Cortes et al. identified allele frequencies in colorectal cancer [375, 376]. Further, GWAS studies conducted by Isac-Lopez et al. [377] predicted the multiple risk loci and highlighted fibrotic and vasculopathy pathways. The results demonstrated that 27 independent genome-wide-associated signals and 13 novel risk loci were associated with systematic sclerosis. Martin et al. studied chromatin interactions to predict novel gene targets in rheumatic diseases. In the same study, the authors concluded that 454 high confidence genes were associated with rheumatic disease, in which 48 were drug targets, and 11 were existing targets. Finally, they demonstrated that 367 drugs were suitable for repositioning [378].

Implementation of artificial intelligence in de novo drug designing

The iterative process to design 3D structures of receptors to generate a novel molecule is termed as de novo drug designing, which is intended to produce new dynamics. However, de novo drug designing has not seen a boundless use in medication disclosure. Further, the field has seen some recovery recently because of advancements in the field of AI [421, 422]. VS has emerged as a massive tool in the drug improvement measure, as it conducts profitable in silico look in an enormous number of blends, further, extending yields of potential medicine leads. As a subset of AI, ML is a technique for coordinating VS for drug leads, which generally incorporates gathering a filtered set of compounds, containing known actives and inactive compounds to train a model [423, 424]. In the wake of setting up the model, it is tested and, if accurate enough, used on a previously unknown database, to identify novel drug. In this section, we discuss how AI has proved to be a boon for drug designing using the de novo technique.

In one study, the researchers utilized the indolent space portrayal to prepare a model dependent on the quantitative estimate of drug-likeness (QED) drug-similarity score and the manufactured availability score synthetic accessibility score (SAS) [425]. In another distribution, the presentation of such a variational autoencoder was contrasted with an antagonistic autoencoder [426]. The ill-disposed autoencoder comprises of a generative model delivering novel compound structures. A second discriminative antagonistic model is prepared to differentiate genuine particles from produced ones, while the generative model attempts to trick the discriminative one [427]. The antagonistic autoencoder created more substantial structures than the variational autoencoder in generation mode essentially. In mix with an in silico model, novel structures anticipated to be dynamic against the dopamine receptor type, 2 could be gotten. Researches utilized a generative ill-disposed organization (GAN) to propose mixes with putative anticancer properties [428].

RNN has likewise been effectively utilized for de novo drug design. Since SMILES strings encode substance structures in a grouping of letters, RNNs have been utilized to generate compound structures. It was observed that RNNs have the potential to utilize SMILES strings for drug designing [429]. A similar methodology was likewise effectively utilized for the development of novel peptide structures [430]. Neural network learning was effectively applied to inclination the created mixes toward wanted properties [431]. Similarly, transfer learning was utilized as another system to create novel synthetic structures with an ideal natural action. In the subsequent steps, the organization is prepared to get familiar with the SMILES syntax with a huge preparing set [432, 433]. In the subsequent advance, the preparation is proceeded with mixes having the ideal movement. Moreover, additional epochs of training were adequate to reach the stage of novel combinations into a compound space involved by dynamic atoms. Five atoms were combined in light of such a methodology, and the plan action could be affirmed for four particles against atomic, chemical receptors [434]. A few distinct designs have been proposed, which have created legitimate, important novel structures. The novel synthesis has been investigated by these strategies, with the property dissemination of the created molecules or atoms being similar to the extensive training set used. The primary application for this strategy was adequate, with 4 out of 5 atoms indicating the ideal action [435]. Optimization of AI and multi-objective has been a promising solution to bridge the chemical and biological phases. Novel pairs of multi-objectives based on RNN for the automated de novo design based on SMILES were developed to find the best possible match between physicochemical properties and their constrained biological targets. The results indicated that AI and multi-objective optimization allows capturing the latent links joining chemical and biological aspects, thus providing easy-to-use options for customizable design strategies, which proved especially effective for both lead generation and lead optimization [436].

ML models like SVM, RF, DNNs, and many others have been used for drug discovery for analyzing the pharmaceuticals applications from docking to VS [437]. Recently, drug repurposing has emerged as an innovative approach to minimize drug development duration that usually involves data mining and AI [438]. A group proposed a question–answer artificial system (QAAI) that had the capability to repurpose drugs that used Google semantic AI universal encoder to compute the sentence embedding in the red brain JSON database. The study validated prediction for the lipoxygenase inhibitor drug zileuton as a modulator of the NRF2 pathway in vitro, with potential applications to reduce macrophage M1 phenotype and reactive oxygen species production. This novel approach has been proved to effective for reposition in NDDs [439]. With the rapid development of systems-based pharmacology and polypharmacology, method development for the rational design of multi-target drugs has to become urgent. The first de novo multi-target drug configuration program known as LigBuilder V3 (http://www.pkumdl.cn/ligbuilder3/) has been devised to design ligands for different receptors, numerous coupling locales of one receptor, or different configurations of one receptor. LigBuilder V3 is again used for multi-target drug plans and enhancement, particularly for compact ligands for proteins with varying ligand binding sites [440]. De novo drug design actively seeks to use sets of chemical rules for the fast and efficient identification of structurally new chemotypes with the desired set of biological properties. Moreover, fragment-based de novo design tools have been successfully applied in the discovery of non-covalent inhibitors. Herein a new protocol, called Cov_FB3D, has been devised, which involves the in silico assembly of potential novel covalent inhibitors by identifying the active fragments in the covalently binding site of the target protein [441].

Artificial intelligence: possible role in pharmaceutical manufacturing and clinical trial design

The use of computational methods is quite well established in the pharmaceutical industries. However, the introduction of AI has given a broader scope to develop new approaches that can improve and optimize drug discovery [442]. This has not only encouraged the scientific community but has also resulted in the growing partnership between the pharmaceutical industry and AI companies [443]. A study stated that the overall success rate for 21,143 drugs was nearly 5.2% in 2013, which was less than 11.2% in 2005. Thus, the use of AI is mainly associated with a need to reduce attrition and costs [444]. It usually takes 12 years to bring a new drug to the market, which can cost up to 3 billion USD [445]. Further, it is a huge task to find a new drug when there are ~ 1060 existing drug-like molecules [446]. The current drug discovery challenges are related to the toxicity of the drug, its side effects, choosing the right target site, appropriate dosages, and even intellectual property [447]. The pharmaceutical industry mostly does not share pharmacokinetic and pharmacodynamic measurements of the drugs until they are approved. In addition to that, very less drug discovery data are available to train AI models [448]. There needs to be a community that can regulate and manage preclinical and clinical pharmacology data to accelerate the progress of AI in this field. Recent advances in AI have impacted clinical pharmacology in many ways like literature searching and processing, interactions with online predictive ML models, ML methods in framing policy to encourage healthcare in many countries and also to get predictive analysis for drug-related information [449, 450].

When a drug candidate successfully passes all preclinical tests, it is then administered to patients under clinical trials, which comprises of three phases: Phase 1, drug safety testing with a small number of people; Phase 2, drug efficacy testing with the small number of human subjects affected by a particular disease; Phase 3, efficacy studies with a large number of patients and after passing the clinical trials FDA reviews it for approval and commercialization [451, 452]. Further, the failure rate of clinical trials adds up to the drug development process's inefficiency, and each failed trial ruins the investment and impairs the costs of preclinical testing. The two main reasons behind high failure rates are improper patient selection and inefficient monitoring during trials. Furthermore, after the introduction of AI technology, the success rates of clinical trials have improved drastically [453]. A system for clinical trial matching has been developed by IBM Watson, which uses medical records of patients and an abundance of past clinical trial data to create detailed clinical findings profiles. It could also be used to keep a check on patients enrolled [454]. AI models can also reduce the cost of clinical trials by enhancing the success rate by analyzing toxicity, side effects, and other related parameters [455]. One such example, which predicted the outcome of phase I and phase II clinical trials, was based on DL and calculated the probability of possible side effects and pathway activation score, which was further used to train the model [456]. Similarly, another project named Visual Physiological Human was made to support in silico trials [457]. Further, development in AI technology will help in better management of clinical trial data, ultimately aiming to develop personalized medicines.

Involvement of artificial intelligence in drug development: a case of neurodegenerative diseases

NDDs are lethal, multifaceted, enervating disorders of the central nervous system and a major cause of death worldwide. AD, PD, Amyotrophic Lateral Sclerosis (ALS), and Huntington’s disease (HD) are some of the most commonly observed NDDs, which can ultimately lead to the death of the neurons in different areas of the central nervous system [458]. The aggregation of toxic, misfolded, cytoplasmic proteins in different brain regions is one of the primary reasons for the inception of these disorders [459]. Further, these disorders can exhibit varying symptoms like cognitive decline, slow movement, tremors, memory loss, depression, speaking problems, muscle stiffness [460, 461]. The major challenge posed by NDDs is in the area of drug discovery as to date, no drug has been discovered, which can arrest and revert the progression of this disorder. Hence, there is a dire need for new drug targets and drug compounds, which can alleviate the symptoms and mitigate the diseased conditions of the central nervous systems [462]. Nowadays, ML is extensively used to find novel targets and biomarkers associated with NDDs. For example, Martínez-Ballesteros et al. 2016 combined DT, quantitative association rules, and hierarchical clustering to determine potential risk genes with AD via gene expression profiling of patient and control samples. Further, [463] used a combination of protein–protein interaction networks, autoencoder, and SVM to predict novel target genes associated with PD. Likewise, [464] used ML models like RF, DT, generalized linear model, and rule induction to find out risk genes of HD through gene expression profiling. Moreover, [465] used a CNN trained on an extensive GWAS data set to find novel risk single nucleotide polymorphisms and genes associated with ALS.

Moreover, ML techniques are also being used to find suitable inhibitors of target proteins implicated in NDDs. For instance, [466] applied a combination of VS, ML, and molecular docking to find class 1 and class IIb histone deacetylase inhibitors, as HDAC enzymes have been reported to promote AD neurotoxicity. Here, ML was used for the classification of inhibitors and non-inhibitors post-VS. Further, [467] used descriptors derived from MD simulation trajectories of the caspase-8 protein–ligand complex to train ANN and RF models to find inhibitors of caspase 8 protease, a protease that has been implicated in AD pathogenesis. In another study, [468] used data from a traditional Chinese medicine database, followed by VS, molecular docking, and ML techniques, including DL, to find inhibitors of GSK3β, an enzyme implicated in AD. Further, MD simulation was used to assess the stability of GSK3β-ligand interactions. Additionally, Ponzoni et al. 2019 made a QSAR model for finding inhibitors of the BACE1 enzyme, which is responsible for β-amyloid (Aβ) aggregation in AD. Here, the QSAR model was built using an optimum set of molecular descriptors, which were sorted out using an amalgamation of ML algorithms, hybridization techniques, backward elimination strategy, and visual analysis [469]. Similarly, [470] used a cascade of Naïve Bayes networks to find potent and safe abelson tyrosine-protein kinase 1 (c-Abl) inhibitors, which promote neuroprotection in PD. Likewise, Shao et al. 2018 used integration of SVM algorithm and Tanimoto similarity-based clustering, followed by in vitro experiments, to find novel antagonists of both A2A adenosine receptor as well as Dopamine D2 receptor, as it has been observed that blocking these two receptors leads to neuroprotection in PD [471]. In addition, [472] implemented molecular docking, AI-QSAR, and MD simulations to find inhibitors of the NLR family pyrin domain containing 3 (NLRP3), an inflammasome involved in PD pathogenesis. Here, VS followed by docking was used to shortlist compounds from the traditional Chinese medicine database, whereas AI and QSAR models were used to ascertain bioactivity of the compounds, followed by assessing their binding stability via MD simulations [472]. Similarly, [473] used molecular docking, AI, and MD simulations to discover inhibitors of Galectin-3 a protein implicated for neuroinflammation in HD. Here, molecular docking was used for initial shortlisting, followed by evaluating the bioactivity of compounds through ML and assessing their binding stability through MD simulations. Further, different studies have used ML algorithms for drug repurposing in NDDs. Similarly, X. Zeng et al. 2019 developed a DL-based drug repurposing tool, called deepDR (https://github.com/ChengF-Lab/deepDR), which is used to find new repurposed drugs for AD and PD [291]. Furthermore, [474] proposed telmisartan as potential repurposed drug for AD by using a genetic network-driven classification model. In addition, [475] proposed a drug repurposing strategy for PD by scanning scientific literature through an integration of knowledge representation learning and ML algorithms .

Future challenges and possible solutions

At present, the major challenge for the pharmaceutical industry while developing a new drug is its increased costs and reduced efficiency. However, ML approaches and recent developments in DL come with great opportunities to reduce this cost, increase efficiency, and save time during the drug discovery and development process. Advances in AI algorithms, especially in DL approaches along with improving architectural hardware and easy accessibility of big data, are all indicating toward the third wave of AI. AI approaches in drug development have aroused great interest among researchers, such that many pharmaceutical companies have collaborated with AI companies. Moreover, the number of startups in this field has also escalated and reached 230 by June 2020 [476]. Further, DL approaches integrate data at multiple levels through nonlinear models, which is the shortcoming of the AI and ML approaches. However, integration of data at multiple levels makes DL algorithm advantageous as it provides great accuracy and precision. Moreover, in comparison with AI and ML algorithms, DL provides a much more flexible architecture to create a neural network for a specific problem [477,478,479,480]. Applications of AI like natural language processing, image, and voice recognition are easily doable these days, which has beaten humans in terms of performance [481]. So, it comes with no surprise that AI can very well be used in the drug discovery process. Today, AI is used in drug discovery for target identification, hit discovery, lead optimization, ADMET prediction, and structuring clinical trials. Despite great success, there are many remaining challenges like high-quality data acquisition under which there are two significant concerns. Firstly, labeling cannot be binary as the action of drugs in biological systems is complicated; secondly, the amount of data available in drug discovery is infinitesimal compared to the enormous amount of information available. Therefore, a community is required that not only provides quantity but the quality of data. In the pharmaceutical industry, open data sharing is not common, and Pistoia alliance has taken the initiative to start a movement that has encouraged many companies to share their data with others. They also intend to establish a uniform data format, which is technically challenging [161]. A possible solution to deal with this problem is to develop an algorithm that can handle sparse data; one such has been developed by Stanford University named “one-shot learning,” which predicts properties of a drug on the basis of heterogeneous data [482]. Moreover, the accuracy and uncertainty of the experimental data can be used for model building, that is instead of establishing new ML technologies, one can put efforts in training the existing one by tuning large number of hyperparameters and optimizing it for good results, although some studies indicated that some reasonable parameters can be used to start the optimization [435]. Molecular representation is also a challenge as it is one of the governing factors in model building. Few recently developed models learn task-related features from the raw data and refine the molecular representation to a standard. Earlier, drug repurposing used to rely only on clinical observations. However, the current large amount of data comprising of scientific literature, patents, and clinical trial results can collectively be used to improve the screening process. Additionally, DL-based VS can make full use of the data and reduce false-positive rates obtained due to imbalance in positive and negative data. Lead optimization is also a challenge in order to develop an efficient drug with good ADMET properties and target activities; however, these parameters are independent and at times mutually incompatible with each other. This problem can be solved by optimizing each parameter separately and further improving the model. Pharmaceutical companies’ faces trouble recruiting sufficient number of patients for clinical trials. AI approaches will help identify and recruit target patients and will also help in managing the collected data. Regarding drug discovery for neurodegenerative disorders, the major problem is their unknown pathophysiology which makes drug identification even more challenging. The “black box” nature of ML models is an additional challenge where even experts cannot explain that how the model arrives at a result and comprehend the biological mechanism behind it. Furthermore, the escalating numbers of ML models and their claim to be latest have left non-professional helpless as they cannot decide which model to choose to solve their problem. Thus, it will be better if users and developers agree upon standard objective evaluation and thereafter check the performance of the model. Further, it is important to note that most of the countries do not give patents to those inventions which are exclusively created by AI technology. Moreover, companies who use AI technology for drug discovery has to go through vigorous process to copyright their work so as to secure patent rights. Security is also a major concern, as AI-driven personalized medicine requires person’s genetic code for which personal information will be required. Finally, faster computation will be required for handling big data and it is said that in future the current supercomputers will be replaced by quantum computers or another technology which will do the job in minutes rather than taking hours. Although AI has given many novel targets and novel compounds for different diseases, still there has not been any success story where a compound generated through AI made it to the market for public use. Recently, for the first time ever, a novel target and its novel inhibitor has been proposed through AI-based tools. In silico medicine, a biotechnology company, proposed a novel target involved in idiopathic pulmonary fibrosis and made its novel inhibitor from scratch, through their AI-based tools. The identified small molecule inhibitor has showed good efficacy in human cells and animal models. In December 2020, in silico nominated their small molecule inhibitor for investigational new drug (IND) enabling studies and they are targeting clinical trials by early 2022. If the trials are successful, then it will be, for the first time ever, where a novel target and its inhibitor was proposed through AI-based tools and got approved. Though there are some unavoidable obstacles and tremendous amount of work has to be done to incorporate AI tools in drug discovery cycle, there is no doubt that in the near future AI will bring revolutionary changes in drug discovery and development process.