Background

The human ether-à-go-go-related gene (hERG) encodes a voltage-dependent ion channel (Kv11.1, hERG) involved in controlling the electrical activity of the heart by mediating the re-polarisation current in the cardiac action potential [1, 2]. Malfunction or inhibition of hERG-channel activity by drug molecules can lead to cardiac arrhythmias in the form of prolonged QT intervals and may lead to sudden cardiac arrest. Therefore, unwanted drug-induced arrhythmias are great concern for pharmaceutical companies and have led to blockbuster drugs being withdrawn from the market and discontinuation of drugs in late stages of development [3]. To prevent new drugs with unwanted hERG-related cardiotoxicity to enter the market, guidelines for assessment of potential for QT interval prolongation by non-cardiovascular medicinal products were decided at the International Conference on Harmonization of Technical Requirements for the Registration of Pharmaceuticals for Human Use (ICH) [4, 5]. These procedures are time-consuming and expensive and therefore, to prevent product depletion due to cardiotoxicity at late preclinical and clinical stages, there is focus on preventing drugs with hERG channel activity from entering drug discovery pipelines in the first instance. To avoid this, computational methods to predict hERG liability have been established and can help prioritise molecules during the early phase of drug development [4]. Most of these methods are based on either machine learning techniques, including random forest (RF), support vector machine (SVM), deep neural networks (DNN) and graph convolutional neural networks (GCN) or on structure based methods including pharmacophore searching, quantitative structure activity relationships (QSAR) and molecular docking [6,7,8,9,10]. Publicly available high quality datasets consisting of molecules classified as hERG and non-hERG blockers are available and often utilized by these computational tools [6, 8, 11]. The datasets annotate chemical structure by SMILES strings which is a chemical language that describes the chemical structure using ASCII character strings. The SMILES strings are readable by expert chemists and are considered a low-level representation of molecular structure [12]. For ease of computational processing, chemical structure is encoded using a fragmentation scheme into binary vectors of fixed length called fingerprints which is another low level representation [13, 14]. Similarly, high level features such as 2D and 3D physicochemical descriptors can be computed from SMILES strings which are then used in various machine learning models [8, 15]. Alternatively, molecular graph representations have been used with graph convolutional neural networks [16]. This intermediate level molecular graph representation offers a compromise between high level physicochemical features and low level SMILES and fingerprints [17]. Under this category, each molecule can be represented via a molecular graph which consists of node features and an adjacency matrix.

Models in most of these previous studies utilize single type of features such physicochemical, fingerprints or graph features which restricts the model performance and its robustness [6, 8, 11, 18]. For instance, CardPred used a total of 3456 physicochemical descriptors and fingerprints with six individual machine learning models [8] to achieve reasonable performance when evaluated against accuracy (ACC) and positive predictive value (PPV) but performed poorly when evaluated against other metrics such as Matthew correlation coefficient (MCC), negative predictive value (NPV), specificity (SPE), sensitivity (SEN) (evaluated on external test sets as reported in the results section) [19]. A method reported by Cai et al. [6] relies on physicochemical descriptors and molecular vectors combined together as a single input for a fully connected multi-task deep neural network to achieve better performance for various metrics except NPV (for their internal cross validation datasets). Li et al. [11] used 8 different types of machine learning models and their ensemble with physicochemical descriptors and fingerprints performed well when evaluated against SPE and PPV but less so for other metrics. The key to success for these previous methods for hERG activity prediction is elucidating correct structure-property relationships from existing data using high level physicochemical features along with fingerprints. Recently the DeepHIT method was introduced which utilizes physicochemical descriptors, fingerprints and graph features with fully connected deep neural networks and graph convolution neural networks to achieving better performance for hERG activity prediction [19]. DeepHIT classifies a molecule as a hERG blocker if at least one model out of the three models used predicts a given molecule as a hERG blocker [19], thus enhancing the sensitivity of the model. Although DeepHIT utilize reasonably diverse feature set, it still lacks in an effective way of combining the outputs of individual models for robust performance over a range of metrics. There is also substantial literature for combining various types of features and features selection for molecular activity prediction, but no clear winner is concluded as yet because performance depends on the characteristics of the molecules used for modeling [20]. In several cases though, it was observed that the accuracy of the models can be improved by feature aggregation because of complementary information [20,21,22,23].

We hypothesize that extraction of chemical information from all or the subsets of three levels of features (low, high and intermediate) and their variants can improve upon the performance over a wide range of accuracy metrics for molecular hERG activity prediction For this purpose, we propose a step-wise training based deep learning framework called CardioTox net, that improves upon the previously published best-in-class results in most of the performance metrics. For three different external test sets, CardioTox net improves Matthew correlation coefficient with a value of (0.599, 0.452, 0.220), accuracy (0.810, 0.755, 0.746), positive predictive value (0.893, 0.455, 0.113) and specificity (0.786, 0.600, 0.698) while keeping the sensitivity same as so far the second best in class method, DeepHIT. Our framework consists of three stages; a featurization stage which generates base features; an individual prediction stage which uses base features with the base individual deep learning models to generate the outputs also called meta features; and a meta ensemble stage which uses meta features generated by the previous stage to classify the molecule as hERG blocker or hERG non-blocker.

Materials and methods

Data preparation

A dataset consisting of molecular structures labelled as hERG and non-hERG blockers in the form of SMILES strings was obtained from the DeepHIT authors [19] and was curated from five sources, the BindingDB database (3056 hERG blockers, 3039 hERG non-blockers) [24], ChEMBL bioactivity database (4859 hERG blockers, 4751 hERG non-blockers) [25], and literature derived (4355 hERG blockers, 3534 hERG non-blockers) [6], (1545 hERG blockers, 816 hERG non-blockers) [7], (2849 hERG blockers, 1202 hERG non-blockers) [26] and unlike in the DeepHIT procedure, we did not use any in-house data. A total of 30000 molecular structures were obtained and were standardized using RDkit [27] and MolVS [28] as described by Ryu et al. [19]. We further removed inconsistently labeled compounds. Thus we obtained total of 12620 molecules with 6643 labelled as hERG blockers and 5977 as hERG non-blockers to constitute our training set. We evaluated our framework against two external independent test sets, one of which was obtained from the authors of DeepHIT [19], hereafter called test-set I which is positively imbalanced (i.e. more blockers (30) than non-blockers (14)). We also retrieved other two independent test sets, thereafter called test-set II from [29, 30] and test set III from [31] as per the criteria of half maximal inhibitory concentration (IC50) values \(< 10\, \upmu \hbox {M}\) considered to be hERG blockers and (IC50) values \(\ge 10\,\upmu \hbox {M}\) considered to be hERG non-blockers. Test-set II is relatively smaller with 11 blockers and 30 non-blockers whereas Test-set III is relatively larger with 53 blockers and 786 non-blockers. The Tanimoto similarity [19] criteria was also ensured for all molecules in both test and training sets (explained in upcoming section of similarity and chemical diversity). The training set was subdivided into four sets, 70% for training the base models, 10% for validating base models, 10% for training the meta ensemble model and 10% for validating the meta ensemble model. The detailed process of data preparation is given in Additional file 1: S1. It should be noted that all the three independent data sets are imbalanced with higher number of hERG non-blockers. As per our knowledge at the time of conducting this research, these are most of the molecules available in public repositories which are dissimilar to our training data. This also demonstrates the real-world scenario for testing where number of non-blockers is usually more than the number of blockers.

Similarity and chemical diversity

A diverse dataset covering a broad chemical space is a prerequisite for building predictive models [32]. For all SMILES strings in training as well as in both external test sets, we computed the 2048 bit Morgan fingerprints using RDKit [13]. The t-SNE dimensional reduction technique [33] was then used to convert the 2048 dimensional vector into two t-SNE dimensions for each SMILES string. As demonstrated by the chemical space defined by the t-SNE components in Fig. 1, diverse chemical space distributions for classified blockers and non-blockers as well as overlap with the external tests sets was observed. We computed the Tanimoto mean value for each of the datasets separately given in Table 1 and a pairwise Tanimoto similarity shown in Fig. 2 for all four datasets [13]. The Tanimoto mean value shows the mean Tanimoto similarity within each data set whereas pairwise Tanimoto similarity shows similarity between different datasets. The lower the Tanimoto mean value is, the better the diversity of the compounds within the data set. As illustrated in Table 1, the Tanimoto mean value is 0.124 for the training set, 0.126 for the external test-set I, 0.116 for the external test-set II and 0.115 for the external test-set III, which means all the three data sets are diverse. Pairwise Tanimoto similarity as shown in Fig. 2 for external test sets, with respect to the training set is always less than 0.7. The external test-set I is also substantially dissimilar to the external test-set II as the maximum pairwise Tanimoto similarity value is less than 0.5 as shown in Fig. 2c. Similarly, we can see that external test-set III is also dissimilar to the training and other test-set-I and test set-II. We also provide top 3 more similar molecules in training data for each molecules of all three test sets in Additional file 3.

Fig. 1
figure 1

Two dimensional t-SNE components showing the chemical space diversity of training and the three external test sets

Fig. 2
figure 2

Pairwise Tanimoto similarity for each molecule in (a) external test-set I with all molecules in training set. b external test-set II with all molecules in training set. c external test-set I with all molecules in external test-set II. d external test-set III with all molecules in training set. e external test-set III with all molecules in external test-set I. f external test-set III with all molecules in external test-set II

Table 1 Statistical description of data sets

Evaluation criteria

In order to measure the classification performance of CardioTox net, we used the following metrics: Area under curve of receiver operating curve (AUC-ROC), specificity (SPE), sensitivity (SEN), negative predictive value (NPV), positive predictive value (PPV), accuracy (ACC) and Matthew’s correlation coefficient (MCC). The details of these metrics are as follows:

  • Area under curve of receiver operating curve (AUC-ROC) which takes into account all the thresholds. The higher the value of AUC-ROC, the better the model is distinguishing between classes (hERG blockers and hERG non blockers). It can be computed by taking area under the curve for true positive rate (TPR) on the y-axis and false positive rate (FPR) on the x-axis for a given dataset. It should be noted that positive refers to hERG blocker and negative refers to non-hERG blocker. TPR which is also called sensitivity (SEN) describes how good the model is at classifying a molecule as a hERG blocker when the actual outcome is also a hERG blocker. FPR describes how often a hERG blocker class is predicted when the actual outcome is non-hERG blocker.

    $$\begin{aligned} SEN= & {} TPR = \frac{TP}{TP + FN} \end{aligned}$$
    (1)
    $$\begin{aligned} FPR= & {} \frac{FP}{FP + TN} \end{aligned}$$
    (2)

    where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives, SEN = Sensitivity.

  • Specificity (SPE) is the total number of true negatives divided by the sum of the number of true negatives and false positives. Specificity would describe what proportion of the non-hERG blocker class got correctly classified by our model.

    $$\begin{aligned} SPE = \frac{TN}{TN + FP} \end{aligned}$$
    (3)
  • Negative predictive value (NPV) describes the probability of a molecule predicted as non-hERG blocker to be actually as non-hERG blocker.

    $$\begin{aligned} NPV = \frac{TN}{TN + FN} \end{aligned}$$
    (4)
  • Positive predictive value (PPV) describes the probability of a molecule predicted as hERG blocker to be actually as hERG blocker.

    $$\begin{aligned} PPV = \frac{TP}{TP + FP} \end{aligned}$$
    (5)
  • Accuracy (ACC) is the fraction of prediction our model got right. i.e it predicted hERG blocker and non-hERG blocker correctly.

    $$\begin{aligned} ACC = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
    (6)
  • Matthews Correlation Coefficient (MCC) has a range of −1 to 1 where −1 indicates a completely wrong binary classifier while 1 indicates a completely correct binary classifier.

    $$\begin{aligned} MCC = \frac{TP * TN - FP * FN}{\sqrt{(TP + FP)(TP+FN)(TN+FP)(TN+FN)}} \end{aligned}$$
    (7)

Featurization stage

The featurization stage of our framework consists of various types of featurizers which takes SMILES string as an input and produce fixed length base features as shown in Fig. 3a.

Fig. 3
figure 3

a CardioTox framework: End to end flow diagram of all the stages of proposed framework. b Architecture specifications of fully connected neural network for 995 2D and 3D descriptors as base features. c Architecture specifications of graph convolutional neural network for node vector of size 50x65 and adjacency vector of size 50x50 as base features. d Architecture specifications of fully connected neural network for 1024 EFCP and 881 pubchem fingerprints as base features (e) Architecture specifications of 1D convolution neural network for SMILES and fingerprints embedding vectors as base features. f Architecture specifications of meta ensemble fully connected neural network for meta features

Descriptors

A total of 995 high level features such as 2D and 3D physicochemical descriptors (DESC) were computed using Mordred [34], names of which are also given in Additional file 2: S5. These features are numerical in nature and describe the physical and chemical properties of molecules [35]. 2D descriptors represents information related to size, shape, distribution of electrons, octanol-water distribution coefficient (LogP) which is a measure for lipophilicity, nAromAtom which shows number of aromatic atoms, nHeavyAtom which shows number of heavy atoms, nBondsT shows number of triple bonds. 3D descriptors relates to the 3D conformation of the molecules such as moment of inertia along Y axis (MOMIY) [35]. The value of each descriptor was normalized between 0 and 1.

Molecular graph featurizer

Topological information of molecules can be intuitively and concisely expressed via molecular graph features. This intermediate level featurizer computes molecular graph features such as node vectors which represents atoms in the SMILES string and an adjacency matrix which shows the bonds between atoms [17]. In this study, we extracted the same graph features as were extracted for DeepHIT [19], i.e a [50 × 65] node vector and a [50 × 50] adjacency matrix, details of which are also given in Additional file 2: S6. Here 50 refers to the maximum number of atoms and 65 refers to the one hot-encoded feature vector computed from atom descriptors [19].

Molecular fingerprint generator

The third featurizer deals with fingerprints where structural features are represented by either bits in a bit string or counts in a count vector [36, 37]. 1024 extended-connectivity fingerprints with a maximum diameter parameter of 2 (EFCP2) fingerprints and 881 pubchem fingerprints were computed using using the Python package PyBioMed [19, 38]. EFCP are also referred to as circular fingerprints and are specifically designed for structure-activity relationship modeling [39] whereas pubchem fingerprints are mainly designed for similarity neighboring and similarity searching [40].

SMILES vectorizer

We also computed two variants of low level features, SMILES strings embedded vectors (SeV) [41, 42] and fingerprint based embedded vectors (FPeV) [14] which themselves do no directly describe any biological attribute of the molecules, but has proven to have a reasonable predictive power in various quantitative structure-activity relationship (QSAR) tasks. In the SMILES vectorizer, we created a vocabulary based on the valid SMILES tokens (procedure described in Additional file 1: S2). A total of 64 unique tokens were determined based on the training data. The longest SMILES string in the data considered for this study was 97. Each SMILES string was converted into a one-hot encoded vector based on the SMILES vocabulary.

Fingerprints vectorizer

In the fingerprint vectorizer, SMILES string are converted into 1024 bit Morgan (or circular) fingerprints with a radius of 2 via RDKit [13]. As per the previously published technique [14], we extracted fingerprint indices which were marked 1 in the fingerprint generated. Thus we obtained a vector of length 93 which consisted of integers representing presence of specific substructures in a molecule. The procedure for fingerprint embedding vector is described in Fig. 1 of FP2VEC [14].

Individual prediction stage

The individual prediction stage consists of base models which are trained on respective base features from the featurization stage. All of the base models were trained at a learning rate of \(10e^{-4}\) with an Adam optimizer and 100 epochs with a batch size of 32. Selection of parameters, hyper-parameters and network architecture of base models were inspired from the previous published research in this area [8, 14, 15, 19, 41,42,43]. Each of these base models produce an output which is a single probability of a molecule being a hERG blocker. Here we describe each base model in the individual prediction stage also shown in Fig. 3b–e. The Keras deep learning framework and Spektral package was used in developing base models for the individual prediction stages [44, 45].

Fully Connected Neural Network for Descriptors (FCNND)

A fully connected deep neural network with 4 hidden layers was trained and validated on 995 2D and 3D physicochemical descriptors. The input layer consists of 995 nodes as per the number of total physicochemical descriptors and an output layer with 1 unit. All the layers in FCNND are densely connected and receives input from all the units present in the previous layer. The number of units in each hidden layer is decreased gradually and a ReLu activation [46, 47] is applied at the end of each layer. Kernel regularizer and bias regularizer of values 0.01 were used in training [47, 48] to reduce the over-fitting during optimization. Kernel regularizer applies penalties to the Kernel (main units in layer) and bias regularizer applies penalties to the bias units. We also applied a drop-out rate of 0.5 to the middle layers [49].

Graph Convolutional Neural Network for Graph features (GCNN)

A graph convolutional neural network (GCNN) was trained using the graph features as shown in Fig. 3c. GCNN consists of two graph convolution layers [50], one global attention pool layer [51] and a dense layer before the output. Each of the graph convolutional layers were initiated with 64 channels with a Kernel regularization value of 0.01 and a ReLu activation. The number of channels in the global attention pool layer was made equal to the number of units in the following dense layer, i.e 1024.

Fully Connected Neural Network for Fingerprints (FCNNF)

A fully connected neural network was used with fingerprints (FCNNF) as the base feature. Unlike FCNND, FCNNF uses a much smaller number of units in each layer. Except the number of units, other parameters were kept the same as in FCNND. The number of input nodes in the input layer were kept at 1905 to match the sum of 1024 EFCP fingerprints and 881 pubchem fingerprints as shown in part Fig. 3d.

Convolution 1D Neural Network for SMILES and Fingerprint embedding vectors (C1D)

For models where SMILES and fingerprint embedding vectors were used as base features, we used a variant of a Convolution 1D Neural Network (C1D) as base model as shown in Fig. 3e. The only difference was in the number of input-layer nodes which was 97 for SMILES embedding vectors and 93 for fingerprint embedding vectors. Input vectors were converted to a trainable embedding matrix of the size [97 or 93 × 200] which was then fed into a series of three 1D convolution layers. Each of these 1D convolution layers used ReLu activation, 192 filters with a Kernel size of 10, 5 and 3 respectively. Two densely connected layers with the parameters shown in Fig. 3e are also used to before the output layer.

Meta ensemble stage

The outputs of each of the base models in the individual prediction stage were concatenated to produce meta features for the meta ensemble model. The Meta ensemble model is a fully connected neural network (FCNNM) with an input, output and two hidden layers as shown in Fig. 3f. It is trained at a learning rate of \(10e^{-3}\) with an Adam optimizer and 300 epochs with a batch size of 32.

Results and discussion

Our proposed framework employs step-wise training to produce the final classification of molecules as hERG or non-hERG blockers. For this purpose, data was divided into four sets, base training set: 70% for training base models , base validation set: 10% for validating base models, meta training set: 10% for training meta-ensemble model and meta validation set: 10% for validating the meta-ensemble model. In the first step of training, all the base models were trained on the base training set and validated using the base validation set. In the second step, the outputs of the best performing base models for the base validation set were used as meta features to train the meta ensemble model with the meta training set. We used the meta validation set to obtain the best meta ensemble model and also to select which combination of the base models ensembling produces better results. We performed consecutive splitting 10 fold cross validation [52] to obtain results given in the following subsection. For each time, we divided the data into 10 parts. Seven parts were used for base training, one part for base validation, one part for meta training and one part for meta validation.

Validation of base model performance

The 10 fold cross validated results for individual base models of our framework on base validation set are shown in Table 2. Each base model is trained and validated with its own respective base features independently. In the Table 2, DESC refers to high level features such as 2D and 3D descriptors feeding the FCNND, MGF refers to intermediate molecular graph features fed into GCNN, MFP refers to low level molecular fingerprints fed into FCNNF, SeV refers to one of the low level variant i.e, SMILES embedding vectors when used with C1D and FPeV refers to low level variant i.e, fingerprint embedding vectors when used with C1D.

Table 2 10 fold cross validated performance of the base models in individual prediction stage on base valid set using their respective base features

As shown in Table 2, DESC performed better in MCC, ACC and PPV whereas MFP performed better in NPV, SEN and AUC. The possible reason might be the direct biological relevance of these base features (descriptors and fingerprints) to the activity prediction. Interestingly, SeV and FPeV showed better performance than MGF despite no biological relevance of the features used. FPeV and SeV achieved almost similar performance in most the of performance metrics. MGF legs behind in most of the metrics except SEN where it achieved slightly better performance than DESC.

Meta validation performance

The overall goal of this study is to aggregate the chemical information extracted from various base features for cardio-toxicity data set so that the classification performance can be improved over a wide range of metrics. For that purpose, the outputs of the base models are concatenated to produce meta features for the use of a meta ensemble model as shown in Fig. 3a. A separate meta training set and meta validation set is used for training and validating the meta ensemble model. Table 3 demonstrates 10 fold cross validation results for the meta validation set for ensembling all possible unique combinations of base features ranging from 1 to 5. For instance, M1 represents single type of base features used in creating meta features whereas M2, M3, M4 and M5 represents any two, three, four and 5 different types of the base features with no repetitions.

Table 3 10 fold cross validation results for various meta features on meta validation set

It can be seen from Table 3 that meta features in M3 and M4 show overall better performance for most of the metrics. In the M4 meta-feature category, M4-5 achieves the best results of MCC: 0.720, ACC: 0.860, PPV: 0.871 and AUC: 0.930. In the M3 meta-feature category, M3-2 achieves the best results for NPV: 0.855 and SEN: 0.874. M3-5 also achieves similar performance of 0.874 for SEN to that of M3-2. Similarly for AUC, M3-7 achieves a similar performance of 0.930 compared to that of M4-5. For SPE however, none of the base-feature combinations (ranging from M2 to M5) improves the performance over M1-1 which is 0.868. Interestingly for SPE, the individual lower performance of MGF, FPeV and SeV (M1-2: 0.792, M1-4: 0.795 and M1-5: 0.791) is improved substantially with meta features comprised of any of the combinations (M2-3: 0.830, M2-4: 0.833 and M2-10: 0.835). This improvement offers some perspective on potentially better ensembling performance even if the individual performance is relatively lower for MGF, FPeV and SeV.

Effectiveness of meta features

In order to investigate the effectiveness of meta features (M2–M5) as compared to the ones which use only single individual base features (M1), we computed % improvement of each of the meta feature ranging from M2 to M4 over best M1 on the meta validation set as shown in Fig. 4a. An overall improvement can be observed in MCC, NPV, ACC, SEN and AUC. For PPV, more fluctuations across zero axis are observed for various meta features. For SPE, there is overall decrease in performance with relatively bigger fluctuations on the negative side. It can be observed from Fig. 4a and Table 3 that for meta feature M4-5, 4 out of 7 metrics shows improvement as compared to best M1. Thus we select meta feature M4-5 as the final unique combination of base features for our CardioTox net framework for further analysis and final evaluation against external test sets.

Fig. 4
figure 4

a shows the affect of various meta features in terms of % improvement over the base features using an ensemble stage of CardioTox framework on meta valid set. b shows the % difference of CardioTox and DeepHIT from their respective best base models performance for various performances metrics

In Figure 4b, we show the % difference of CardioTox and DeepHIT from their respective best base model performances for various performance metrics. The values in Fig. 4b are retrieved from Table 2 of the DeepHIT publication [19] and Table 3 for CardioTox. As shown in Table 2 of DeepHIT, the best performance is shown by Descriptor-based DNN for all metrics. DeepHIT is optimized for SEN and NPV with a substantial sacrifice of MCC, ACC, PPV and SPE. It improves SEN by 12.48% and NPV by 9.59% with a sacrifice of 4.47% MCC, 2.87% ACC, 10.63% PPV and 18.09% SPE. On the other hand, CardioTox net improves MCC by 5.7%, NPV by 2.34%, ACC by 2.37%, PPV by 1.15% and SEN by 2.52% with a sacrifice of 1.39% in SPE only. With an overall improvement in nearly all the metrics for a relatively little sacrifice of SPE as compared to DeepHIT, CardioTox net performance can be considered more robust.

Comparative landscape using the external independent test sets

We compared CardioTox net results with state of the art methods such as DeepHIT [19], CardPred [8], OCHM Predictor-I and OCHM Predictor-II [11] and Pred-hERG 4.2 [18] on three external test sets given in Table 4. For test set-I and test set-II, CardioTox net achieves improved performance for all metrics except SEN where its performance is the same as achieved by second best method DeepHIT. The achieved performance for MCC is (0.599, 0.452), PPV is (0.893, 0.455) and SPE is (0.786, 0.600) over DeepHIT for test set-I and test set-II respectively. The SEN is 0.833 for test set-I and 0.909 for test set-II which is the same as achieved by DeepHIT. For ACC and NPV, the performance for test set-I and test set- II is (0.810, 0.755), and (0.688, 0.947) which is also better than DeepHIT. OCHM-Predictor I, II achieves better performance for PPV and SPE but lags behind significantly in all other metrics for both test sets. Pred-hERG 4.2 performs reasonably well for SEN in both tests but performs worse in other metrics. Interestingly for test-set II, OCHEM-Predictor I and II performs reasonably well for PPV and SPE with less sacrifice in other metrics as compared to its performance on test set-I. For test set-III which is relatively larger, our method achieves better performance for all metrics as compared to DeepHIT except SEN where it achieves same performance as DeepHIT. For test set-III as well, OCHEM Predictor-I achieves better performance for PPV and SPE only while legging behind significantly in other metrics. For SEN though, Pred-hERG 4.2 achieves the highest value.

Table 4 Comparison of CardioTox with other methods using three external independent test sets. B-ACC refers to balanced accuracy

DeepHIT is specifically designed and trained to obtain better NPV and SEN by using physicochemical descriptors, fingerprints and graph features with three deep learning base models. CardPred used an individual neural network model (out of six other models) with physicochemical descriptors and fingerprints. OCHMI and OCHMII used range of machine learning models trained on various types of high level physicochemical descriptors. Pred-hERG 4.2 used fingerprints and molecular descriptors with support vector machines to classify the molecules for hERG blocking activity. By using a step-wise training strategy with base and meta ensemble models, CardioTox net shows robust performance against a range of accuracy metrics as compared to the state of the art methods on three independent test sets.

We also compared our results with three classical machine learning methods such as random forest [53], support vector machines [54] and gradient boosting algorithm [55] as shown in Table 4. We first converted all SMILES training as well as test data into 995 2D and 3D physicochemical descriptors (DESC) using Mordred [34]. For all of the three classical methods, we used scikit-learn [56] machine learning library with default settings. For the test set-I which has more positive samples, all three classical machine learning performs the worst of all other methods in nearly all metrics. Support vector machines performs randomly for test set-I. Random forest and gradient boosting performs slightly better than a random classifier. For test set-II and III which have more negative samples, classical methods performance is comparable to other deep learning based methods as shown in Table 4. It should be noted that our model assigns a probability to each molecule under test. The value of the probability if greater than or equal to 0.5 declares the molecule to be hERG blocker.

Conclusion

In this study, we introduced a deep learning based framework called CardioTox net for classifying drug-like molecules as hERG blockers and hERG non blockers. Our approach is based on step-wise training of base and meta ensemble deep learning models. In the first step, 5 deep learning base models are trained and validated. Each of these base models use different types of base features ranging from high level to low level descriptors and their variants. In the second step of training, the output of base models is concatenated to form meta features for training and validating the meta ensemble model. We found that high level physicochemical, low level fingerprints, SMILES embedding vectors and fingerprint embedding vectors when used to create meta features for the meta ensemble model, enhance the performance over a wide range of metrics for the cardio toxicity prediction task. We evaluated our framework against various classification metrics using three independent test sets and obtained a robust performance compared to state of the art methods. Our framework is a robust method for classifying small drug-like molecules as hERG blockers and hERG non blockers.