The advancements in technologies coupled with reducing instrumentation cost have resulted in increased data generation in both quantity and diversity, leading to numerous data resources [1]. Big data comprises this collection of data of enormous volume and complexity. The drastic increment of data has resulted in this data’s availability across varied platforms, in public and commercial resources [2]. The resulting data-centric environment has mandated the acquisition, integration and analysis of big data to decipher complex medical and scientific problems. This gigantic complex data mining to uncover the underlying meaningful hidden patterns is equally significant and is referred to as big data analytics [3]. In the modern era, the emergence of big data has revolutionized the process and strategies to tackle drug development [4]. It has also facilitated and accelerated the translation of basic research discoveries into clinical practice and transformed the process of conventional drug discovery to a data-driven approach [4,5,6]. The availability of data-rich resources has encouraged the exploitation of artificial intelligence (AI) that mimics human intelligence to solve multifaceted challenges in the drug discovery process, from design and identification of novel drug molecules, drug repurposing, testing and clinical trial to personalized medicine [7,8,9,10]. Thus, AI applications related to big data analytics in the pharmaceutical space are witnessing a constant interest in making the multipronged approach of the multifaceted drug development process more promising and less time-consuming. However, some hurdles still need to be overcome despite numerous advancements, leaving sufficient room for further data-driven AI-led innovations [11].

The evolution of big data and artificial intelligence has reformed the strategies adopted to shorten the drug development process. The artificial intelligence approach has enabled the development of drug candidates in a more structured and economical manner and within a considerably shorter time period. The computational resources and algorithms in the drug discovery process utilize existing data to provide better analytics and assessment, from identifying a drug candidate to the pharmaceutical industry’s manufacturing process [11,12,13]. Hence, prior to the synthesis and experimental evaluation of the drug molecule, the AI-driven analysis facilitates identifying and screening the drug candidates against the desired disease effectively and efficiently.

Presently, AI is a rapidly evolving field that involves various domains, such as reasoning, knowledge representation, and machine learning (ML). Machine learning has been widely implemented for numerous drug discovery applications pertaining to large data sets. It uses various algorithms and techniques to recognize templates and patterns within the given data set [14]. Its primary application in drug designing is to identify and exploit the relationship between the chemical structure and their biological activities, referred to as the structure–activity relationship (SAR). The advent of massive sequencing approaches like next-generation sequencing (NGS) has resulted in the exponential growth of sequences, thus identifying potential fruitful putative novel drug targets [15]. Machine learning (ML) approaches have contributed significantly to drug target prediction from the available large-scale data sources. ML methods have been classified under two broad subcategories, supervised learning and unsupervised learning methods. The prominent algorithms in drug discovery applications are random forest (RF), support vector machine (SVM), gradient boosted machine with trees (GBM), elastic net regulation (EN), deep learning (DL), and deep neural network (DNN) [16, 17]. The continuous increment in data and limitations within the ML approaches has led to the emergence of deep learning (DL) methodology, a subfield of machine learning that uses the power of artificial neural network (ANN) [7]. The quantitative structure–activity relationship (QSAR) methods widely used in drug design are regression models used to predict the biological activity of the chemical compounds. Increasingly, ANN methods are now being frequently utilized in the pharmaceutical space for drug designing by parameterizing the QSAR model nonlinearly. The basic concept of ANN is to mimic the functioning of electrical impulses generated by neurons in the human brain. This is achieved by computing units referred to as ‘perceptrons’ which are interconnected like the neurons in the brain and possess self-learning capabilities [18]. The artificial perceptrons in ANN constitute a set of nodes required for data input and output to solve biological problems. It is commonly used in drug discovery to resolve the complexity of screening compounds and to estimate the pharmacokinetics and pharmacodynamics parameters [19]. Other types of ANN include multilayer perceptron networks (MLP), recurrent neural networks (RNNs), convolutional neural network (CNNs) and autoencoders, which use either supervised or unsupervised learning methods [20]. The advancement of ANN, called deep neural network (DNN), is now gaining attention for its successful application in drug discovery-related areas such as, to generate novel molecules, predicting the biological activity as well as the absorption, distribution, metabolism, excretion and toxicity (ADMET) properties of the drug candidate molecules. Like the ML approach, deep learning method was found to be effective in building the QSAR/QSAP models [21].

In this review, the emphasis is on the role of big data and artificial intelligence in the area of drug design. It attempts to provide a current conceptual framework and “state-of-the-art” snapshot of this domain. Several ML architectures, including the supervised and unsupervised methods and their application in small molecule drug discovery, have also been emphasized. Various other articles available in the public domain have focused either on the role of machine learning [14] or deep learning [22,23,24] methods, while some have discussed the big data resources in the drug discovery [10, 11, 23, 25]. However, there is currently no single review paper that has covered all these aspects of drug design, from the big data resources to an overview and explanation of the development of the implemented algorithms. This review attempts to fill these lacunae and presents in a nutshell how these algorithms were developed and implemented to uplift the drug discovery process in the modern AI era. Thus, this review comprises an insight into the deployment of big data resources in the modern ‘big data’ era by engaging advanced AI algorithms and providing an integrated, synthesized summary of the current state of knowledge regarding machine learning and big data in drug discovery.

Advent of AI in drug design

Drug discovery is a complex and lengthy venture which requires a multidisciplinary approach. A drug molecule to reach the market passes through multiple defined stages, wherein each step has its challenges, timeline and cost. Despite numerous advancements in the understanding of biological systems, identifying a novel drug molecule for therapeutic purposes still remains largely a lengthy, costly and complicated process [26]. The human genome project (HGP) has facilitated several advancements in drug development, including precision medicine and target identification for a disease. Compared to the traditional approach, both in vitro and in silico methods have a greater propensity to lower drug discovery costs. These computational approaches in the early stages of drug development also minimize the time span to distinguish a drug candidate with suitable therapeutic effects by excluding compounds exhibiting complex side effects. The modern drug discovery pipelines integrate hierarchical steps that engage various phases such as target identification, target validation, screening of lead candidates against the desired target, optimization of identified hits to increase the affinity, selectivity, metabolic stability, and oral bioavailability. Once a lead molecule is recognized and evaluated, it undergoes preclinical and clinical trials. Finally, the identified molecule that complies with all these investigations moves forward for approval as a drug.

The advancements over time in computational chemistry and high throughput screening (HTS) strategies have fast-tracked the prompt screening of millions of compounds against the specific identified drug targets. These techniques produce a large quantity of biological data accumulated in the databases and public repositories. The generation of massive data due to the advancement in technology for drug and drug candidates has shifted the modern drug discovery approaches towards the big data era. Previously, big data analytics was widely used in information technology, but nowadays, with the available large-scale data, it has been frequently implemented in all the engineering and science domains, including drug discovery. Data mining of this complex and heterogeneous data across many resources is highly crucial. This has resulted in big data-related novel computational tools and algorithms for its curation and management and put forth challenges and opportunities for the research communities [27]. Moreover, advancements in high computing facilities, together with the emergence of artificial intelligence (AI) and machine learning (ML) algorithms play a prominent part in computer-aided drug design technology to screen and mine the lead-like molecules against the desired target more efficaciously with reduced cost and time (Fig. 1) [19].

Fig. 1
figure 1

Growth of machine learning with the subsequent increase in big data and computation power; KB—Kilobyte, MB—Megabyte, CPU—Central processing unit, GPU—Graphics processing unit, HTS—High throughput sequencing

Currently, there exist several opportunities to apply both AI and ML associated with big data in drug discovery applications, such as protein folding prediction, protein–protein interaction, virtual screening, QSAR, de novo drug designing and drug repurposing. Several approaches like high throughput virtual screening (HTVS), molecular docking, pharmacophore modelling, QSAR and molecular dynamics simulation are widely used for drug discovery [28]. Computer-based drug discovery implements virtual screening (VS) as the primary method to filter out novel small molecules from large compound libraries against the desired target for therapeutic effect in the early phase of drug discovery [29]. It also helps to determine the novel scaffolds for further optimization of the hit molecules. Computer-based drug discovery can be broadly classified into structure-based drug discovery (SBDD) and ligand-based drug discovery (LBDD). In structure-based drug discovery, the target structure is used to identify a potent drug molecule against a particular disease, whereas the ligand-based drug discovery is an effective method based on the structural knowledge of chemical scaffolds to design compounds with improved biological activity. The pharmacophore modelling method is used in both the structure-based and ligand-based drug discovery approach, while molecular docking, and molecular dynamics (MD) simulation studies are extensively used in structure-based drug discovery. In contrast, scaffold hopping and QSAR are the widely used methods for ligand-based drug discovery. [19, 30].

Similar to computer-based drug design, virtual screening methods also fall under two broad categories depending on the available structural information: structure-based virtual screening (SBVS) and ligand-based virtual screening (LBVS). Structure-based virtual screening (SBVS) explores the interaction between the ligand molecule and binding site residues. In contrast, the ligand-based virtual screening (LBVS) method uses the chemical similarity approach to identify a drug molecule. However, both are an integral part of the drug design method and have their merits and demerits. Structure-based virtual screening is widely used when the structure of the target is known, and it exploits the information gleaned from protein–ligand interaction during the docking study through the scoring function analysis to identify the potent drug molecules against the desired target. Whereas, the ligand-based virtual screening method is not generally based on the availability of the target structure but on the chemical similarity approach to identify the drug candidate and hence may be biased towards the reference scaffold. The exponential increase in structural and protein–ligand binding data has necessitated engaging AI methods to deduce these interactions to enable further development of SBVS. ML-based methods such as support vector machine (SVM), random forest (RF) and boosting help us to establish the nonlinear dependence of molecular interactions between the ligand and target [31]. Loss of relevant information during feature extraction in ML can be solved through the deep learning (DL)-based approach. Deep learning methods permit automatic generation of higher level hierarchical abstractions from big data that can be used as features, thus reducing the dependency for feature generation in ML. Another type of DL, the convolutional neural network (CNN), has been notably adapted for virtual screening as it implements feature extraction based on small sections of the input image referred to as receptive fields. DeepVS is a deep learning-based programme that utilizes CNN methodology for screening compounds against the desired target [32]. PTPD another tool based on CNN, has been developed for designing peptide-based molecules [33].

Ligand-based virtual screening depends on the data set of ligands which are further classified into the active and inactive set for classification and regression purposes to predict the activity of the compounds. Based on the physicochemical analysis and spatial similarities between the active ligands, it identifies and predicts other ligand molecules with higher bioactivities. This method predicts the active ligand when the target structure is missing or structural accuracy is low for the known targets. Like structure-based drug design, the adoption of machine learning methods in ligand-based drug designing leads to an improved rate of predicted hits by minimizing the rate of false hit prediction [34]. With the ever-increasing data size and number of active compounds in the chemical space, the development of ML algorithms has become indispensable to handle the big data sets without compromising speed and accuracy. The limitations in addressing the large data set were overcome by the emergence of deep learning (DL) methods that could efficiently manage large data sets [35]. Deep learning is a sub-branch of machine learning. It emphasizes on the neural network with multiple layers of the perceptron, which help in learning data with multiple layers of abstraction that are beneficial for supervised and unsupervised learning [36]. Recent progress in computational power to comprehend big data and convert it for reusable knowledge gain has further boosted AI in the drug design process [37]. The popular deep learning-based libraries such as Tensorflow and PyTorch are widely used to screen big data for drug discovery applications.

Big data resources in drug design

The large-scale data exists in diverse forms and data types which can be raw or processed, standardized or unstandardized. The extraction of meaningful information from this heterogeneous data is a challenging task. The drug discovery process relies on data from several disciplines such as clinical data, bioassay, pharmacological and structural biology. These data generated from distinct domains and sources encompass a divergent array of large data sets where artificial intelligence plays a significant role in solving the complexity present in the data [38]. The continuous incrementation in big data requires greater computational resources and advanced computational algorithms to analyse the resulting complex data. The demand for enhanced computational power has resulted in a paradigm shift from personal computers to high-performance computing, cloud computing, and graphical processing units (GPUs) to analyse big data [3]. The accumulated big data utilized for drug discovery can be classified into various categories or databases such as a collection of chemical compounds (e.g. PubChem, ChEMBL), drug/drug-like compounds (e.g. Drugbank, e-Drug3D), collection of drug targets, including the genomic and proteomic data (e.g. Binding DB, Supertarget), databases containing the collection of assay screening, metabolism and efficacy studies (e.g. HMDB, TTD) (Table 1). Over the years, several data-sharing projects have been initiated parallel to the development of high throughput screening (HTS) techniques [39].

Table 1 Data sources used in drug discovery

Big data is required at different stages of the drug discovery process. The initial step in the drug discovery process involves the screening of gigantic libraries containing chemical compounds to wean out probable lead drug candidates. The chemical compound library space is enormous and comprises both virtual, designed, and synthesized compounds with descriptions of their properties and distribution sourced across both public and subscribed databases. Thus, these data sources are massive and provide a range of multidimensional data for drug discovery and development, including the chemical structure, chemical assay, target structure, clinical data. The quantity and mass of these data resources are expanding exponentially with time, unlocking avenues to exploit artificial intelligence and machine learning for rapid and effective drug discovery solutions.

Feature/descriptor representation

Most machine learning algorithms cannot use the protein sequence information or molecular structure information directly from the databases. The protein sequences and molecular structures need to be transformed through mathematical equations before they can be handled by machine learning algorithms. The protein sequence-based features like physicochemical properties, amino acid composition, dipeptide composition, pseudo-amino acid composition (captures long range sequence correlation) and amino acid distribution, exploit numerical techniques to convert these variable length protein sequences into fixed length feature vectors for input to machine learning algorithms. Similarly, numerical features consisting of 1D (molecular weight etc.), 2D (molecular fingerprints etc.) and 3D (volume etc.) descriptors are calculated for the molecules to make them suitable for machine learning-based analytics (Table 2). Simplified molecular-input line-entry system (SMILES) and strings are some of the commonly utilized molecular representations or notations. With an increase in the dimensionality of the descriptor class, information content about the descriptors is also expanding. Several software resources like Open Babel [40], PaDEL [41], Dragon [42], MOE [43], PeptiDesCalculator [44], AlvaDes [45], QuBiLS-MAS [46] are currently available which can calculate a wide set of different descriptors (OD/1D/2D/3D) from the SMILES format or 2D structure of the chemical compounds.

Table 2 Different classes of descriptors with their examples

Artificial intelligence methods and their role in drug discovery

Artificial intelligence (AI) can explore and sort through available data, recognize and learn patterns from the input unstructured/structured data to extract gainful insights from the input data. AI can be classified into different categories such as reasoning and problem solving, representation of knowledge, planning and social intelligence, perception, machine learning, robotics and natural language processing (NLP) [47]. General intelligence remains amongst the long-term goals of AI. The various tools exploited in AI include statistical methods, computational intelligence, optimization, logic, methods based on probability and related methods to solve problems of interdisciplinary areas such as, computer science, mathematics, psychology, linguistics, drug discovery, and neuroscience. Speech recognition technology has also been empowered by the use of AI to automate transcription service. In speech recognition, AI enables us to convert the voice message into text and aids individual recognition based on their voice command.

On the other hand, NLP enables us to understand the natural human language and categorize it into different subsets such as classification, machine translational, and text generation based on their utility. The popular examples of NLP currently widely accepted are virtual assistants like Google assist, Siri and Alexa [48]. Machine learning (ML) and deep learning (DL) are the subsets of AI technology and are extensively used for prediction and classification purposes. ML algorithms recognize patterns from the data set for further classification [14]. DL, a subfield of machine learning, deploys artificial neural networks (ANNs) for different tasks. Adopting AI for solving data-intensive processes has opened up newer possibilities in the drug design space [7]. AI has, thus, revolutionized and accelerated rational drug designing from machine learning and finally to deep learning in the present big data era.

Artificial intelligence methods: advantages and pitfalls

Machine learning

Machine learning methods can be defined as a set of algorithms that do not require human intervention and explicit instructions for learning [71]. Big data has opened immense opportunities for machine learning methods to be developed specifically to handle the four V’s: Volume, Variety, Velocity and Veracity, and mine interesting patterns [72]. Big data’s sheer size or volume presents several challenges for traditional machine learning algorithms, such as processing time and memory requirement [73]. The second ‘V’, variety, comprises different forms/structures of data that can be unstructured, semi-structured, or structured. Velocity refers to the speed/ frequency with which the incoming data needs to be processed. Veracity concerns the trustworthiness/ reliability of the data. Machine learning algorithms are generally employed for classification and regression tasks. In the former case, the objective is to discriminate between two or more classes (binary and multiclass classification problems). In contrast, the problem of regression involves predicting a real-valued quantity or variable [74]. The typical steps for implementing machine learning-based prediction methods consist of data preprocessing, model learning, and evaluation. The data preprocessing steps comprise preparing the data suitable for the various machine learning algorithms, such as discretization and standardization. The model learning phase constitutes the actual implementation of the machine learning algorithms. The final phase involves performance evaluation methods and metrics to assess the numerous trained machine learning models (Fig. 2).

Fig. 2
figure 2

Workflow of machine learning (ML) process in drug discovery

Big data also presents a challenge in evaluating the imbalanced distribution of available data [75, 76]. The dataset is imbalanced when the instances for a particular class overwhelms the instances of other class/classes with its sheer number [77, 78]. When the dataset is imbalanced, the accuracy of the learned model tends to shift towards the majority class compared to the minority class resulting in majority class classifiers [79, 80]. The models trained on imbalanced datasets are biased towards predicting the majority class over the minority class (which is often the class of interest). To diminish the effects of imbalanced datasets, generally, two types of approaches are undertaken: (i) changes at the algorithm level to make them suitable for handling the imbalanced datasets, (ii) Resampling methods: which are non-algorithm specific and consist of different types of sampling methods. Random undersampling concerns balancing the majority and minority class instances and involves the random removal of a percentage of majority class instances. Since this engages random deletion of instances, it can lead to bias and drop of information triggering loss of unique instances. To mitigate the shortcomings of random undersampling, K-means clustering based sampling and Kennard stone sampling is exploited [80]. Other variants of under sampling in practice include cluster centroid-based, K-nearest neighbour-based, etc. [81]. Another approach to handle imbalanced datasets is oversampling. Random oversampling may result in sample redundancy as there is ample chance that similar instances are replicated during balancing. Random oversampling is just the reverse of random undersampling, where a fixed proportion of minority class samples are randomly replicated. The outcome is duplication of samples culminating in redundant information. These methods are significant for clinical research in drug sampling and drug epidemiology. SMOTE [82] and its variants, such as borderline-SMOTE, SVM-SMOTE [83], present an effective way of balancing without much bias [84]. K-means, along with SMOTE, further reduces the bias. SMOTE is a nearest neighbour-based method that uses a predefined number of neighbouring minority samples to interpolate a new synthetic minority sample [77].

Deep learning

The rise of deep learning neural networks (DLNN) has revolutionized the analysis of big data. DLNNs have greatly benefitted from using ReLu activation function to avoid the vanishing gradient problems, which have plagued the shallow neural networks since their inception. They consist of an input layer, an output layer and more than two hidden layers in their architecture [85]. As the number of hidden layers increases, the network’s capability to extract more and more features enhances. Hence, the complexity of the features to be extracted is directly proportional to the number of hidden layers. The successful training of DLNNs usually requires vast amount of data as the number of parameters is quite large (e.g. every weight associated with each connection between the neurons in the network can be considered as a parameter). It has been observed that using a small amount of data for training results in suboptimal trained networks. Apart from parameters learned during the training process, some hyperparameters are also to be considered for optimal training of a DLNN [86]. Hyperparameters are crucial as they decide how the network is trained and significantly impact the model’s performance. These are also referred to as the ‘tuning parameters’ as some of them are iteratively fine-tuned using an appropriate algorithm. In DLNN, the number of layers, number of neurons per layer, activation function are some of the common hyperparameters [87]. Optimal hyperparameter setting changes with each dataset, as they are tuned for the individual datasets. When DLNNs are trained with a stochastic gradient descent algorithm, the network weights are updated depending on the learning rate (a hyperparameter). A large learning rate results in faster training of the model but may result in suboptimal solutions, while a smaller learning rate results in slow training of the network. A suitable learning rate results in the best approximate solution depending on the predefined number of training epochs. For obtaining the optimal set of hyperparameters, random search along a grid is often used. Overfitting occurs when the learning algorithm learns the minute details of the dataset instead of generalization. The accuracy of DLNNs can be improved by employing regularization parameters, such as L1 (lasso regression) and L2 (ridge regression) regression models, which help to avoid overfitting. Regularization imposes higher penalty on complex models as compared to simpler models but not at the cost of reduction in predictive performance. L1 regularization adds the absolute value of the coefficient and results in shrinking the less significant feature’s coefficients to zero and facilitates feature selection since features with zero coefficients can be removed from the model.

Loss function with L1 regularization can be given by Eq. (1)

$$ {\text{loss}} = \left( {y,\hat{y}} \right) + \lambda \mathop \sum \limits_{{i = 1}}^{n} \left| {\beta _{i} } \right| $$

where y = true value; \(\hat{y}\)  = predicted value; λ = parameter governing the magnitude of penalty applicable to the model; n = number of features; βi = model coefficient.

In contrast, L2 regularization utilizes the square magnitude of the feature’s coefficients and results in shrinking coefficients evenly. It prevents overfitting of data and is especially useful in cases where collinear features are present.

Loss function with L2 regularization can be given by Eq. (2)

$$ {\text{loss}} = \left( {y,\hat{y}} \right) + \lambda \mathop \sum \limits_{{i = 1}}^{n} \beta _{i}^{2} $$

where y = true value; \(\hat{y}\)  = predicted value; λ = parameter governing the magnitude of penalty applicable to the model; n = number of features; βi = model coefficient.

Dropout has also proved to be an important technique in reducing the effect of overfitting [88]. Dropout involves the random deletion of a specified percentage of neurons and their connections in different deep network layers. This results in making the network more robust to memorization and increases generalization (Fig. 3).

Fig. 3
figure 3

a Deep learning neural network (DLNN) without dropout b Deep learning neural network (DLNN) with dropout

Deep learning variants

Generative adversarial networks (GANs) are combination of two competing neural networks; a generative network and a discriminator network. The purpose of the discriminator network is to classify and distinguish the real data from the fake data. The generative network produces the fake data by using feedback from the discriminator which is trained on real labelled data (i.e. consisting of class information). The iterative procedure of optimizing fake data to resemble the real data by the generative network and its discrimination by the discriminative network continues until local Nash equilibrium is attained, at which there is no further reduction in the cost of both generator and the discriminator [89]. Many novel applications of GANs in cheminformatics and computer-aided drug design have emerged recently [90]. The modification of GANs such as conditional GAN [91] and Wasserstein GAN [92] have proved to be very useful in various tasks such as novel molecule design (Fig. 4) [93, 94] and for optimization of molecules with desired properties [95, 96].

Fig. 4
figure 4

De novo chemical design using generative adversarial networks (GANs)

Convolutional neural networks (CNN) (such as VGGNet, VGG19) [97] are variants of DLNNs, which are mainly used for computer vision and image classification. CNNs consist of three components—convolution layer, pooling layer and the fully connected layer. The convolution layer is involved with recognizing the colour and edges of an image and results in the generation of activation maps. The pooling layer reduces the spatial dimension of the activation maps, and the fully connected network executes the image classification. A different variant of CNNs such as Inception [98] and ResNet are considered state of the art in computer vision/ image classification [99]. High-accuracy CNN models have been implemented for the diagnosis of diseases such as cancer [100]. Recently, CNNs are being trained to mine protein–ligand interactions [101, 102], text mining [103] and toxicity prediction of compounds from their graphic images [104]. Recurrent neural networks (RNN) [105] can model sequential information. Long short-term memory (LSTM) units are primarily used for constructing the RNNs [106]. They can also be used for generative purposes [35, 107]. The concept of multitask learning [108, 109] involves training a learning algorithm on similar tasks rather than on a single task, proving to be very effective in cheminformatics, such as toxicity prediction [110, 111].

Hybrid approaches like LSTM-GAN (long short-term memory–generative adversarial network), DCGAN (deep convolutional generative adversarial network), gcWGAN (guided conditional Wasserstein GAN), which are constructed using different deep learning paradigms, have been successfully used in de novo protein design [112, 113].

One major drawback of DLNNs is that they are like black boxes and do not interpret the decision-making/classification process. Recently, to mitigate the black box assumption, VIP (Variable Importance) charts and SHAP (SHapely Additive exPlanations) plots were introduced, which have lessened the black-box nature of DLNNs to some extent. SHAP is based on game theory and has been mainly adopted to deduce the importance of an individual feature and its distribution over the target variable [114, 115]. Platforms like H2O (, TensorFlow [116], Keras ( are being implemented to train DLNNs with big data. Traditional visualization methods may not be optimal for these enormous datasets, whereas newer methods such as t-sne can be exploited readily [117].


Autoencoders are unsupervised neural networks that are trained to reproduce the input (reconstructed input) at its output nodes (Fig. 5). In between the hidden layers, the autoencoders transform the input into hierarchical higher-order representations (Fig. 6). As different attributes/features of the data present a different facet, therefore, it is not known in prior as to which feature /attribute will result in better training for a machine learning algorithm [118]. These higher-order representations can be used as features/attributes in the training of learning algorithms. Autoencoders are mainly used for dimensionality reduction and anomaly detection [119]. In relation to drug discovery, they have been mainly practised for dimensionality reduction of features for drug target interaction prediction [120], initialization of model parameters [121] and assessing drug similarities [122].

Fig. 5
figure 5

Representation of an autoencoder. The green circles represent the hidden layer

Fig. 6
figure 6

A deep autoencoder with hidden layers. The hierarchical representations from the hidden layers can be used as features in the training of learning algorithms

Ensemble learning

It is also possible to apply several different classifiers together for constructing the final classification and regression tasks. The ensemble learning approach [123] utilizes many different base classifiers in the initial phase and their decision fusion in the final stage. This provides a critical advantage as each base classifier’s deficiency can be possibly compensated by other different base classifiers. Stacking, StackingC, and Voted ensemble classifiers are most commonly used to construct ensemble classification systems [124]. In stacking, the first step involves training different base classifiers, and the second step consists of combining the outputs of the base classifiers using a metaclassifier (Fig. 7). Voted ensemble classifiers can be constructed by using the more popular majority voting scheme where the class is predicted based on the votes by different base classifiers. It can also be implemented using the average vote rule, maximum and minimum probability rule, and product probability rule [125]. The concept of ensemble learning has also been implemented for regression problems where instead of a discreet class, a real numbered value (target variable) is being predicted [126]. Two prominent ensemble learning algorithms are Bagging and AdaBoost employed for QSAR modelling.

Fig. 7
figure 7

Schematic representation of stacking ensemble approach

Deep belief networks

Deep belief networks (DBN) are generative graphical deep learning networks [13] that consist of restricted Boltzmann machines or autoencoders and are characterized by the absence of connections between units present in the same layer. They can be employed to train both in a supervised and unsupervised manner [127]. They have found definitive applications in virtual screening [128], multilabel classification of multi-target drugs [129] and in the classification of small molecules into drugs and non-drugs [130].

Performance evaluation metrics

The machine learning algorithms have to be assessed critically for their performance. The performance evaluation metrics such as accuracy, sensitivity, specificity, G-means are commonly used, which are calculated from the various quadrants of a confusion matrix (TP: true positives, TN: true negatives, FP: false positives, FN: false negatives) [131]. The evaluation of regression models is mainly assessed by determining the mean absolute error, mean squared error, and root-mean-squared error. The various evaluation parameters for measuring the performance of machine learning algorithms include:

Accuracy: This is the total number of all correct predictions out of the total number of samples as shown in Eq. (3).

$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} $$

Sensitivity: It is the percentage of the correctly predicted positive class represented by Eq. (4).

$$ {\text{Sensitivity}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} $$

Specificity: It is the percentage of the correctly predicted negative class (Eq. (5)).

$$ {\text{Specificity}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}} $$

G-means: It is a very useful metric to gauge the machine learning model’s performance in class imbalance scenarios, as shown in Eq. (6).

$$ g{\text{-means }} = \sqrt {{\text{Sensitivty}} \times {\text{Specificity}}} $$

F-score: It is also known as the F1 score and is defined as the harmonic mean of precision and recall, as given in Eq. (7).

$$ F~{\text{score}} = 2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}} $$

Cohen’s Kappa (K): It is a quantitative measure of the reliability of two classifiers that classify the same thing and quantify the agreement between the classification outcomes (Eq. (8)). A score of 0 means there is an agreement due to chance alone, a score of 1 means complete agreement and a score below 0 means less agreement than expected due to chance alone

$$ K = \frac{{Po - Pe}}{{1 - Pe}} $$

where P0: observed agreement; Pe: the expected probability of chance agreement.

Mean absolute error (MAE): It is defined as the absolute difference between the actual target value and the value predicted by the trained model. It is represented as Eq. (9).

$$ {\text{MAE}} = \frac{1}{n}\sum Y - \mathop Y\limits^{ \wedge } $$

where n: number of samples; Y: actual target value; \(\mathop Y\limits^{ \wedge }\) : predicted target value.

Mean squared error (MSE): It is defined as the average of the squared differences between the actual target values and the predicted target values (Eq. (10)).

$$ {\text{MSE}} = \frac{1}{n}\sum (Y - \mathop Y\limits^{ \wedge } )^{2} $$

where n: number of samples; Y: actual target value; \(\mathop Y\limits^{ \wedge }\): predicted target value.

Root-mean-squared error (RMSE): It is defined as the square root of the average of the squared differences between the actual target values and the predicted target values. RMSE is used in cases where large errors are to be penalized as represented by Eq. (11).

$$ {\text{RMSE}} = \sqrt {\frac{1}{n}\sum (Y - \mathop Y\limits^{ \wedge } )^{2} } $$

where n: number of samples; Y: actual target value; \(\mathop Y\limits^{ \wedge }\): predicted target value.

Applications in drug discovery

Computer resources are essential for effective AI execution. Thus, the rise of high-performance computing clusters, development in graphics processing unit (GPU) power, cloud-based sources and accumulation of massive chemical informatics data have further augmented the evolution of artificial intelligence (AI) technology [132]. This technology has turned the drug discovery paradigm uphill and completely transformed the pharmaceutical space work culture. AI capitalizes on the predictive hypothesis from the available large data sets compared to the traditional trial and error approach for drug discovery [1]. Currently, R&D sectors of renowned pharmaceutical companies such as Pfizer, GlaxoSmithKline, Novartis, Merck, Sanofi, Genentech and Takeda are adopting machine learning and artificial intelligence to manage the enormous generated data to deliver cost-effective solutions. It is proposed that the market of AI-based drug discovery will reach $1.43bn dollars in 2024, with an annual enhancement of 40.8%. The increase in the number of cross-industry collaborations and partnerships to control the escalating drug discovery costs are major factors responsible for the rise of the AI market in drug discovery and development [24, 133].

The drug discovery approach encompasses various steps from target identification to the clinical phase. The recent breakthrough in AI technology and its incorporation has benefitted the various phases of drug discovery and the pharmaceutical industry. This technology provides innovative solutions in all aspects of the multifaceted drug discovery process such as, in the identification of drug targets, screening of lead compounds from data libraries, drug repurposing, predicting the toxicity of compounds, predicting bioactivity of compounds, de novo design and in automation of compound synthesis [134,135,136]. The different areas where AI has significantly contributed to the various stages of drug design are shown in Fig. 8.

Fig. 8
figure 8

Role of AI technology in different phases of drug discovery

In structure-based drug discovery, a target is essential for the successful design of a drug molecule. Homology modelling and de novo protein design are the traditional methods for structure modelling. The emergence of AI technology has contributed enormously in predicting the 3D structure of the protein as well as in determining the effect of a compound on the designed target. Recurrent neural network (RNN) and deep neural network algorithms are widely exploited in target modelling studies. Alpha fold, an AI tool that relies on DNN, is widely used to predict the 3D structure from its primary sequence [137]. The feature extraction potential of deep learning makes it a promising method to predict the secondary structure, backbone torsional angle and residue contacts in protein. Thus, protein folding study can be determined from its sequences with the help of AI methods [138, 139]. DN-fold is another deep learning network method widely used for protein folding and can efficiently predict the structural fold of the protein [140]. With the growth of protein sequence data, AI methods also significantly contribute in predicting the protein–protein interaction studies by using the DNN called DeepPPI, which outperforms (prediction accuracy 80.82%) the traditional ML-based approach (prediction accuracy 65.80%), as the latter approach is faced with the problem of manual feature extraction [141].

Apart from protein modelling, AI has a role in drug screening, where it reduces the time to identify a drug-like compound. ML algorithms such as nearest neighbour classifiers, RF, extreme learning machines, SVMs, and DNNs are used for the drug molecule’s virtual screening and synthetic feasibility. ML-based drug screening has been successfully applied to identify drug-like molecules against various diseases such as cancer and neurogenerative disorders [142,143,144]. The incorporation of AI has opened up newer avenues and transformed the drug discovery process. AI and ML implementation has guided the exploration of low molecular weight compounds for their therapeutic potential. Zhavoronkov et al. performed a deep learning analysis to discover novel inhibitors of an enzyme, DDR1 kinase [145]. McCloskey et al. employed ML models like Graph CNN and RF to identify novel small drug-like molecules against three different proteins [146]. Small molecules were predicted against rheumatoid arthritis using an integrated approach of ML and DL [147]. Another study performed using an AI-based method identified the hepatotoxic ingredient from Chinese traditional medicine [148]. Predictive models have been developed for screening liver toxicity induced due to drugs using ML algorithms [110]. This technology has contributed to the current pandemic scenario to recognize drug-like molecules against the different SARS-CoV-2 targets. Numerous studies have been performed to identify potent lead molecules against the novel coronavirus using traditional medicine. Xu et al. used ML and molecular modelling to identify the inhibitors against 3CL proteinase [149]. The deep learning approach has also assisted in the identification of potential drug targets for SARS-CoV-2 [150]. Studies have also encompassed drug repurposing approaches against targets of novel coronavirus using AI methods [151, 152]. DL-based platform DeepDTA has been deployed on marketed antiviral drugs to predict possible therapeutic agents against COVID-19 [153].

Pharmaceutical industries, namely Bayer, Roche, and Pfizer, have collaborated with the IT companies to develop an AI-enabled platform for therapeutics discovery in areas such as immune-oncology and cardiovascular diseases [154]. Apart from drug screening, AI has considerably improved the scoring functions of docking methods to evaluate drug molecule binding affinity towards the target. ML-based approaches such as RF and SVM aided the development of scoring functions by effectively extracting the geometric, chemical and physical force field features. Due to the advancement of deep learning methods in image processing, CNN has been incorporated successfully to extract features from the protein–ligand image and predict protein–ligand binding affinity [155, 156]. DeepVS, a deep learning-based software used for molecular docking studies, is extensively employed over traditional docking programmes based on its scoring functions [32].

After identifying the hit or lead molecules in the drug discovery pipeline, a series of tests and evaluation studies are executed to assess the physicochemical and toxicity properties of the candidate drug molecule. Thus, early identification and weaning of drug candidates with poor physical and chemical properties reduce the failure rate during the drug discovery process [157]. The AI-based methods aid the execution of this process in a time-efficient manner from a large dataset to effectively predict the physicochemical properties of the compounds [158, 159]. Both ML and deep learning-based algorithms are employed in this process. Various tools based on CNN, deep neural network, RF are available, namely TargeTox [160], DeepTox [110], DeepNeuralnetQSAR [161], eToxPred [162], DeepDTA [163], GraphDTA [164], and DeepAffinity [165], which can afford the prediction of the toxicity and physicochemical properties of the compounds from the large compound libraries.

AI-based methods are comparatively more effective and widely used nowadays in de novo drug design and compound synthesis automation [166]. Established automated techniques such as solid phase are currently used to synthesize several compounds, including peptides and oligonucleotides. This method suffered from the lack of standardized digital automation to control the chemical reaction due to the absence of a suitable universal programming language. Thus, with the advancement of AI methodology, the deep learning approach has been incorporated to generate new chemical entities with its powerful learning capabilities. Deep neural network (DNN), reinforcement learning (RL), variational autoencoder (VAE) and multilayer perceptron (MLP) are currently adopted for de novo drug design and automation process [167, 168]. Chemputer is a recently developed platform that gives a detailed recipe for molecule synthesis and is exercised in compound synthesis automation. Three pharmaceutical compounds, diphenhydramine hydrochloride, rufinamide, and sildenafil, have been successfully automated through this method [169, 170]. The purity and yield of the synthesized compounds were comparable with or better than the manual synthesis. Thus, AI has moved forth in the pharmaceutical industry to automate and up scale the bench chemistry with an edge over the safety, efficacy and accessibility of the identified complex molecules.

AI has also contributed immensely to the various steps involved in clinical trial research. It can be deployed for remote surveillance to access real-time data with increased efficacy. AI can assist in decision-making for patient recruitment from a defined cohort, replanning patient treatment regime through patient response monitoring to a drug, determining patient dropout rate and the final efficacy of the drug [171]. BioXcel Therapeutics ( have successfully identified BXCL701, a candidate molecule using AI technology that is effective against schizophrenia and bipolar disorder. BXCL701 is also currently in different phases of clinical trials against pancreatic cancer, for which it has obtained FDA approval [172]. Thus, conventional drug discovery concepts combined with advanced computational approaches provide an excellent platform for research and development to enhance the drug discovery and development process.

Available AI computational tools for drug design

The power of computer software in the area of drug design is evident from the initial stages of drug discovery. The advancement in software and its availability opens new opportunities for their application in research and learning processes. Open-source software has gained popularity due to its easy availability and accessibility. Many researchers also share their programmes on Github and other platforms to accelerate and permit widespread use of the drug discovery process through these AI resource (Table 3). Several open-source deep learning frameworks are also available for users, such as TensorFlow, Pytorch, Keras, scikit learn, MXNet, Gluon, Swift, and Chainer ONNX. These frameworks require high-performance computing resources across various platforms, including CPUs, GPUs, and tensor processing units (TPUs) [173]. The inbuilt libraries are based on the deep learning framework and are applicable in multiple areas of science and technology, including health care. TensorFlow, Pytorch, Keras, and Scikit-learn based on python-based libraries are widely used in drug discovery where large datasets are present. TensorFlow (TF) is a framework from Google that can be utilized to develop models to predict the molecular activity of the compound dataset using the deep learning approach. Keras is an advancement over TensorFlow and is user-friendly and easy to debug. Pytorch is also an open-source project used to define and train models to gain insight into the complex link between the drugs and accelerate the drug discovery process [174]. Scikit-learn presents an open-source, user-friendly platform for classification, regression, and dimensionality reduction purposes. Some softwares are also available such as Weka, which is extensively utilized for machine learning-based applications in drug discovery, classification and clustering purposes.

Table 3 AI computational tools for drug design

Apart from the freely available resources, some companies, namely, Janssen, AstraZeneca, Novartis, Sanofi, are currently exploring the potential of AI technology in the healthcare sector. They have collaborated with the software and data science companies namely IBM Watson, Microsoft, PointR data, Numerate, BenevolentAI, Atomwise which provide them support and a cloud-based/server to implement AI according to their requirement for research purposes in drug discovery against various diseases (Table 4).

Table 4 Collaborations of AI organization with pharmaceutical companies

Challenges and future perspectives

The advent of faster and lower-cost technology coupled with development in computing power has accelerated the pace for data generation leading to several enormous compound data sources. This mandated implementing numerous artificial intelligence and machine learning approaches at various drug discovery stages to mine pharmaceutical knowledge from large-scale ‘big’ data. The knowledge gleaned from applying these AI algorithms in big data has provided a stimulus to design and discover novel molecules and their further optimization. This technique has helped push forward the drug discovery process by automating and customizing the process and affirming big data’s significance. The impact of artificial intelligence is gaining steadily in the academic sector and the pharmaceutical companies concomitantly with a surge of startups and AI-based R&D companies. Compared to the traditional high throughput screening methodology, an AI-based computational pipeline can screen virtual compound libraries rapidly to identify preclinical candidates. Besides drug screening, AI tools can be witnessed in different stages of the drug discovery cycle, such as predicting the physical properties, bioactivity, toxicity of the molecules, ADME properties, protein structure prediction and patient recruitment and surveillance.

Apart from the varied application of AI-based technology, some limitations and challenges still need to be overcome. The triumph of AI-based technology relies on the ease and frequency of data availability to the users. The multiple ‘V’ features of big data such as volume, velocity, variety, and volatility require improved data curation and management and user-friendly web portals. Thus, reliable and high-quality curated data is essential to glean insightful information. Though AI technology is slowly revolutionizing the drug discovery process through accelerated drug design methods and lower failure rates, the lack of adequate curated data and data accessibility can prove to be a hurdle. Other rate limitation steps include difficulty in the constant and expeditious updation of the available software as per the format of generated data and recently developed algorithm. Additionally, skilled personnel for the full-fledged operation of AI-based applications in drug discovery are not readily available. Despite the advances and popularity of machine learning approaches, some aspects still remains to be extensively explored, such as predicting conformational changes in protein and the binding affinity between the drug molecule and the target. Since deep learning requires massive data, this technique is limited only by the data extent and quality. Thus, rapid transfer of learning technology development can be a better approach to solving this problem. Although these advanced approaches displayed high prediction accuracy and performance, deep learning still works as a “black box” approach and its mechanism to solve the problem remains unclear. Moreover, though AI technology and gigantic data sources have contributed enormously to speeding up the drug design pipeline, experiments still need to be conducted before the drugs can be approved. Regardless of the limitations, AI has changed the landscape of drug discovery, and with its surging demand, it will soon become an essential, integral tool in the search for novel drugs and their targets and the pharmaceutical sector in the not too distant future.