Background

It is not a mystery that the process of drug design and development is very complex and absorbs a huge amount of time and money [1, 2]. Although nowadays it significantly differs from the drug design strategies from the past (the emergence of new medicines used to be rather a result of serendipity and fortunate accidents [3]), it is still a subject to relatively high risk of failure. Nevertheless, the current strategies of searching for new drugs are much more structured and several steps can be distinguished within them, such as target identification, finding the lead structure, its optimization, preclinical studies and 3 phases of clinical tests [4, 5].

Finding a new active compound towards a particular target is just the first step in the long path of its possible transformation into a drug. Meeting the affinity requirements is not sufficient, as a compound needs to possess favourable physicochemical and pharmacokinetic properties as well, and it should not display any toxic effects [6,7,8]. Within the set of considered parameters it is also important to put attention to metabolic stability, because if a compound is transformed in the organism too quickly, it does not have enough time to induce a desired biological response [9].

Metabolic stability is one of the most difficult parameters to be predicted by computational tools due to extreme complexity of processes related to xenobiotic transformations in the living organisms. The main role in xenobiotic metabolism is played by cytochrome P450—a group of haemoprotein enzymes with monooxidase activity. Almost sixty CYP isoforms occur in human organisms; however, it is CYP3A4 that is responsible for metabolism of the majority of drugs [10,11,12].

A high number of processes that contribute to metabolic stability makes the correct prediction of this parameter a challenging task. As a result, publications on in silico tools for evaluating the speed of compound metabolism are scarce. Here, we mention a few examples of such studies. Schwaighofer et al. [13] analyzed compounds examined by the Bayer Schering Pharma in terms of the percentage of compound remaining after incubation with liver microsomes for 30 min. The human, mouse, and rat datasets were used with approximately 1000–2200 datapoints each. The compounds were represented by molecular descriptors generated with Dragon software and both classification and regression probabilistic models were developed with the AUC on the test set ranging from 0.690 to 0.835. Lee et al. [14] used MOE descriptors, E-State descriptors, ADME keys, and ECFP6 fingerprints to prepare Random Forest and Naïve Bayes predictive models for evaluation of compound apparent intrinsic clearance with the most effective method reaching 75% accuracy on the validation set. Bayesian approach was also used by Hu et al. [15] with accuracy of compound assignment to the stable or unstable class ranging from 75 to 78%. Jensen et al. [16] focused on more structurally consistent group of ligands (calcitriol analogues) and developed predictive model based on the Partial Least-Squares (PLS) regression, which was found to be 85% effective in the stable/unstable class assignment. On the other hand, Stratton et al. [17] focused on the antitubercular agents and applied Bayesian models to optimize metabolic stability of one of the thienopyrimidine derivatives. Arylpiperazine core was deeply examined in terms of in silico evaluation of metabolic stability by Ulenberg et al. [18] (Dragon descriptors and Support Vector Machines (SVM) were used) who obtained performance of R2 = 0.844 and MSE = 0.005 on the test set. QSPR models on a diverse compound sets were constructed by Shen et al. [19] with R2 ranging from 0.5 to 0.6 in cross-validation experiments and stable/unstable classification with 85% accuracy on the test set.

In silico evaluation of particular compound property constitutes great support of the drug design campaigns. However, providing explanation of predictive model answers and obtaining guidance on the most advantageous compound modifications is even more helpful. Searching for such structural-activity and structural-property relationships is a subject of Quantitative Structural-Activity Relationship (QSAR) and Quantitative Structural-Property Relationship (QSPR) studies. Interpretation of such models can be performed e.g. via the application of Multiple Linear Regression (MLR) or PLS approaches [20, 21]. Descriptors importance can also be relatively easily derived from tree models [20, 21]. Recently, researchers' attention is also attracted by the deep neural nets (DNNs) [21] and various visualization methods, such as the ‘SAR Matrix’ technique developed by Gupta-Ostermann and Bajorath [22]. The ‘SAR Matrix’ is based on the matched molecular pair (MMP) formalism, which is also widely used for QSAR/QSPR models interpretation [23, 24]. The work of Sasahara et al. [25] is one of the most recent examples of the development of interpretable models for studies on metabolic stability.

In our study, we focus on the ligand-based approach to metabolic stability prediction. We use datasets of compounds for which the half-lifetime (T1/2) was determined in human- and rat-based in vitro experiments. After compound representation by two key-based fingerprints, namely MACCS keys fingerprint (MACCSFP) [26] and Klekota & Roth Fingerprint (KRFP) [27], we develop classification and regression models (separately for human and rat data) with the use of three machine learning (ML) approaches: Naïve Bayes classifiers [28], trees [29,30,31], and SVM [32]. Finally, we use Shapley Additive exPlanations (SHAP) [33] to examine the influence of particular chemical substructures on the model’s outcome. It stays in line with the most recent recommendations for constructing explainable predictive models, as the knowledge they provide can relatively easily be transferred into medicinal chemistry projects and help in compound optimization towards its desired activity or physicochemical and pharmacokinetic profile [34]. SHAP assigns a value, that can be seen as importance, to each feature in the given prediction. These values are calculated for each prediction separately and do not cover a general information about the entire model. High absolute SHAP values indicate high importance, whereas values close to zero indicate low importance of a feature.

The results of the analysis performed with tools developed in the study can be examined in detail using the prepared web service, which is available at https://metstab-shap.matinf.uj.edu.pl/. Moreover, the service enables analysis of new compounds, submitted by the user, in terms of contribution of particular structural features to the outcome of half-lifetime predictions. It returns not only SHAP-based analysis for the submitted compound, but also presents analogous evaluation for the most similar compound from the ChEMBL [35] dataset. Thanks to all the above-mentioned functionalities, the service can be of great help for medicinal chemists when designing new ligands with improved metabolic stability. All datasets and scripts needed to reproduce the study are available at https://github.com/gmum/metstab-shap.

Results

Evaluation of the ML models

We construct separate predictive models for two tasks: classification and regression. In the former case, the compounds are assigned to one of the metabolic stability classes (stable, unstable, and of middle stability) according to their half-lifetime (the T1/2 thresholds used for the assignment to particular stability class are provided in the Methods section), and the prediction power of ML models is evaluated with the Area Under the Receiver Operating Characteristic Curve (AUC) [36]. In the case of regression studies, we assess the prediction correctness with the use of the Root Mean Square Error (RMSE); however, during the hyperparameter optimization we optimize for the Mean Square Error (MSE). Analysis of the dataset division into the training and test set as the possible source of bias in the results is presented in the Appendix 1. The model evaluation is presented in Fig. 1, where the performance on the test set of a single model selected during the hyperparameter optimization is shown.

Fig. 1
figure 1

Global prediction power of the ML algorithms in a classification and b regression studies. The Figure presents global prediction accuracy expressed as AUC for classification studies and RMSE for regression experiments for MACCSFP and KRFP used for compound representation for human and rat data

In general, the predictions of compound half-lifetimes are satisfactory with AUC values over 0.8 and RMSE below 0.4–0.45. These are slightly higher values than AUC reported by Schwaighofer et al. (0.690–0.835), although datasets used there were different and the model performances cannot be directly compared [13]. All class assignments performed on human data are more effective for KRFP with the improvement over MACCSFP ranging from ~ 0.02 for SVM and trees up to 0.09 for Naïve Bayes. Classification efficiency performed on rat data is more consistent for different compound representations with AUC variation of around 1 percentage point. Interestingly, in this case MACCSFP provides slightly more effective predictions than KRFP. When particular algorithms are considered, trees are slightly preferred over SVM (~ 0.01 of AUC), whereas predictions provided by the Naïve Bayes classifiers are worsefor human data up to 0.15 of AUC for MACCSFP. Differences for particular ML algorithms and compound representations are much lower for the assignment to metabolic stability class using rat datamaximum AUC variation is equal to 0.02.

When regression experiments are considered, the KRFP provides better half-lifetime predictions than MACCSFP for 3 out of 4 experimental setupsonly for studies on rat data with the use of trees, the RMSE is higher by 0.01 for KRFP than for MACCSFP. There is ~ 0.02–0.03 RMSE difference between trees and SVMs with the slight preference (lower RMSE) for SVM. SVM-based evaluations are of similar prediction power for human and rat data, whereas for trees, there is ~ 0.03 RMSE difference between the prediction errors obtained for human and rat data.

Regression vs. classification

Besides performing ‘standard’ classification and regression experiments, we also pose an additional research question related to the efficiency of the regression models in comparison to their classification counterparts. To this end, we prepare the following analysis: the outcome of a regression model is used to assign the stability class of a compound, applying the same thresholds as for the classification experiments. Accuracy of such classification is presented in Table 1.

Table 1 Comparison of accuracy of standard classification and class assignment based on the regression output

Analysis of the classification experiments performed via regression-based predictions indicate that depending on the experimental setup, the predictive power of particular method varies to a relatively high extent. For the human dataset, the ‘standard classifiers’ always outperform class assignment based on the regression models, with accuracy difference ranging from ~ 0.045 (for trees/MACCSFP), up to ~ 0.09 (for SVM/KRFP). On the other hand, predicting exact half-lifetime value is more effective basis for class assignment when working on the rat dataset. The accuracy differences are much lower in this case (between ~ 0.01 and 0.02), with an exception of SVM/KRFP with difference of ~ 0.75. The accuracy values obtained in classification experiments for the human dataset are similar to accuracies reported by Lee et al. (75%) [14] and Hu et al. (75–78%) [15], though one must remember that the datasets used in these studies are different from ours and therefore a direct comparison is impossible.

Global analysis of all ChEMBL data

We analyzed the predictions obtained on the ChEMBL data with the use of SHAP values in order to find these substructural features, which have the highest contribution to particular class assignment (Fig. 2) or prediction of exact half-lifetime value (Fig. 3); class 0—unstable compounds, class 1—compounds of middle stability, class 2—stable compounds.

Fig. 2
figure 2

The 20 features which contribute the most to the outcome of classification models for a Naïve Bayes, b SVM, c trees constructed on human dataset with the use of KRFP

Fig. 3
figure 3

The 20 features which contribute the most to the outcome of regression models for a SVM, b trees constructed on human dataset with the use of KRFP

Analysis of Fig. 2 reveals that among the 20 features which are indicated by SHAP values as the most important overall, most features contribute rather to the assignment of a compound to the group of unstable molecules than to the stable onesbars referring to class 0 (unstable compounds, blue) are significantly longer than green bars indicating influence on classifying compound as stable (for SVM and trees). However, we stress that these are averaged tendencies for the whole dataset and that they consider absolute values of SHAP. Observations for individual compounds might be significantly different and the set of highest contributing features can vary to high extent when shifting between particular compounds. Moreover, the high absolute values of SHAP in the case of the unstable class can be caused by two factors: (a) a particular feature makes the compound unstable and therefore it is assigned to this class, (b) a particular feature makes compound stablein such case, the probability of compound assignment to the unstable class is significantly lower resulting in negative SHAP value of high magnitude.

For both Naïve Bayes classifier as well as trees it is visible that the primary amine group has the highest impact on the compound stability. As a matter of fact, the primary amine group is the only feature which is indicated by trees as contributing mostly to compound instability. However, according to the above-mentioned remark, it suggests that this feature is important for unstable class, but because of the nature of the analysis it is unclear whether it increases or decreases the possibility of particular class assignment.

Amines are also indicated as important for evaluation of metabolic stability for regression models, for both SVM and trees. Furthermore, regression models indicate a number of nitrogen- and oxygen-containing moieties as important for prediction of compound half-lifetime (Fig. 3). However, the contribution of particular substructures should be analyzed separately for each compound in order to verify the exact nature of their contribution.

In order to examine to what extent the choice of the ML model influences the features indicated as important in particular experiment, Venn diagrams visualizing overlap between sets of features indicated by SHAP values are prepared and shown in Fig. 4. In each case, 20 most important features are considered.

Fig. 4
figure 4

Overlap of important keys for a classification studies and b regression studies; c) legend for SMARTS visualization. Analysis of the overlap of the most important keys (in the number of 20) indicated by SHAP values for a classification studies and b regression studies; c legend for SMARTS visualization (generated with the use of SMARTS plus (https://smarts.plus/); Venn diagrams generated by http://bioinformatics.psb.ugent.be/webtools/Venn/

When different classifiers are analyzed, there is only one common feature which is indicated by SHAP for all three models: the primary amine group. The lowest overlap between pairs of models occurs for Naïve Bayes and SVM (only one feature), whereas the highest (8 features) for Naïve Bayes and trees. For SVM and trees, the SHAP values indicate 4 common features as the highest contributors to the assignment to particular stability class. Nevertheless, we should remember that for Naïve Bayes the prediction accuracy was significantly lower than for SVM or trees; and therefore, the features indicated by this approach are also less reliable.

Finally, 4 features are common for SVM and trees in the case of regression experiments: the already mentioned primary amine group, alkoxy-substituted phenyl, secondary amine, and ester. This is in line with the intuition on the possible transformations that can occur for compounds containing these chemical moieties.

Case studies

In order to verify the applicability of the developed methodology on particular case, we analyze the output of an example compound (Fig. 5).

Fig. 5
figure 5

Analysis of the metabolic stability prediction for CHEMBL2207577 for human/KRFP/trees predictive model. Analysis of the metabolic stability prediction for CHEMBL2207577 with the use of SHAP values for human/KRFP/trees predictive model with indication of features influencing its assignment to the class of stable compounds; the SMARTS visualization was generated with the use of SMARTS plus (https://smarts.plus/)

The highest contribution to the stability of CHEMBL2207577 is indicated to be the aromatic ring with the chlorine atom attached (feature 3545) and thiophen (feature 1915), the secondary amine (feature 677) lowers the probability of assignment to the stable class. All these features are present in the examined compounds and their metabolic stability indications are already known by chemists and they are in line with the results of the SHAP analysis.

Web service

The results of all experiments can be analyzed in detail with the use of the web service, which can be found at https://metstab-shap.matinf.uj.edu.pl/. In addition, the user can submit their own compound and its metabolic stability will be evaluated with the use of the constructed models and the contribution of particular structural features will be evaluated with the use of the SHAP values (Fig. 6). Moreover, in order to enable manual comparisons, the most similar compound from the ChEMBL set (in terms of the Tanimoto coefficient calculated on Morgan fingerprints) is provided for each submitted compound (if the similarity is above the 0.3 threshold).

Fig. 6
figure 6

Screens of the web service a main page, b submission of custom compound, c stability predictions and SHAP-based analysis for a submitted compound. Screens of the web service for the compound analysis using SHAP values. a main page, b submission of custom compound for evaluation, c stability predictions for a submitted compound and SHAP-based analysis of its structural features

Obtaining such information enables optimization of metabolic stability as the substructures influencing this parameter are detected. Moreover, the comparison of several ML models and compound representations allows to provide a comprehensive overview of the problem.

An example analysis of the output of the presented web service and its application in the compound optimization in terms of its metabolic stability is presented in Fig. 7.

Fig. 7
figure 7

Custom compound analysis with the use of the prepared web service and output application to optimization of compound structure. Custom compound analysis with the use of the prepared web service, together with the application of its output to the optimization of compound structure in terms of its metabolic stability (human KRFP classification model was used); the SMARTS visualization generated with the use of SMARTS plus (https://smarts.plus/)

The analysis of the submitted compound (evaluated in the classification studies as stable) indicates that the highest positive contribution to its metabolic stability has benzaldehyde moiety, and the feature which has a negative contribution to the assignment to the stable class is aliphatic sulphur. The most similar compound from the ChEMBL dataset is CHEMBL2315653, which differs from the submitted compound only by the presence of a fluorine atom. For this compound, the substructure indicated as the one with the highest positive contribution to compound stability is fluorophenyl. Therefore, the proposed structural modifications of the submitted compound involves the addition of the fluorine atom to the phenyl ring and the substitution of sulfone by ketone.

Conclusions

In the study, we focus on an important chemical property considered by medicinal chemistsmetabolic stability. We construct predictive models of both classification and regression type, which can be used for computational assessment of this parameter with the use of the provided on-line tool. Moreover, we use an explainability method called SHAP to develop a methodology for indication of structural contributors, which have the strongest influence on the particular model output. Finally, we prepared a web service, where user can analyze in detail predictions for CHEMBL data, or submit own compounds for metabolic stability evaluation. As an output, not only the result of metabolic stability assessment is returned, but also the SHAP-based analysis of the structural contributions to the provided outcome is given. In addition, a summary of the metabolic stability (together with SHAP analysis) of the most similar compound from the ChEMBL dataset is provided. All this information enables the user to optimize the submitted compound in such a way that its metabolic stability is improved. The web service is available at https://metstab-shap.matinf.uj.edu.pl/.

Methods

Data

We use CHEMBL-derived datasets describing human and rat metabolic stability (database version used: 23). We only use these measurements which are given in hours and refer to half-lifetime (T1/2), and which are described as examined on’Liver’,’Liver microsome’ or’Liver microsomes’. The half-lifetime values are log-scaled due to long tail distribution of the metabolic stability measurements. In case of multiple measurements for a single compound, we use their median value. In total, the human dataset comprises 3578 measurements for 3498 compounds and the rat dataset 1819 measurements for 1795 compounds. The resulting datasets are randomly split into training and test data, with the test set being 10% of the whole data set. The detailed number of measurements and compounds in each subset is listed in Table 2. Finally, the training data is split into five cross-validation folds which are later used to choose the optimal hyperparameters.

Table 2 Number of measurements and compounds in the ChEMBL datasets

In our experiments, we use two compound representations: MACCSFP [26] calculated with the RDKit package [37] and Klekota & Roth FingerPrint (KRFP) [27] calculated using PaDELPy (available at https://github.com/ECRL/PaDELPy)—a python wrapper for PaDEL descriptors [38]. These compound representations are based on the widely known sets of structural keys—MACCS, developed and optimized by MDL for similarity-based comparisons, and KRFP, prepared upon examination of the 24 cell-based phenotypic assays to identify substructures which are preferred for biological activity and which enable differentiation between active and inactive compounds. Complete list of keys is available at https://metstab-shap.matinf.uj.edu.pl/features-description. Data preprocessing is model-specific and is chosen during the hyperparameter search.

For compound similarity evaluation, we use Morgan fingerprint, calculated with the RDKit package with 1024-bit length and other settings set to default.

Tasks

We perform both direct metabolic stability prediction (expressed as half-lifetime) with regression models and classification of molecules into three stability classes (unstable, medium, and stable). The true class for each molecule is determined based on its half-lifetime expressed in hours. We follow the cut-offs from Podlewska et al. [39]:

  •  ≤ 0.6low stability,

  • (0.6 − 2.32 > medium stability,

  •  > 2.32high stability.

Models

In our experiments, we examine Naïve Bayes classifiers, Support Vector Machines (SVMs), and several models based on trees. We use the implementations provided in the scikit-learn package [40]. The optimal hyperparameters for these models and model-specific data preprocessing is determined using five-fold cross-validation and a genetic algorithm implemented in TPOT [41]. The hyperparameter search is run on 5 cores in parallel and we allow it to last for 24 h. To determine the optimal set of hyperparameters, the regression models are evaluated using (negative) mean square error, and the classifiers using one-versus-one area under ROC curve (AUC), which is the average AUC of all possible pairwise combinations of classes. We use the scikit-learn implementation of ROC_AUC score with parameter multiclass set to ‘ovo’.

The hyperparameters accepted by the models and their values considered during hyperparameter optimization are listed in Tables 3, 4, 5, 6, 7, 8, 9. After the optimal hyperparameter configuration is determined, the model is retrained on the whole training set and evaluated on the test set.

Table 3 Hyperparameters accepted by different Naïve Bayes classifiers
Table 4 The values considered for hyperparameters for Naïve Bayes classifiers
Table 5 Hyperparameters accepted by different tree models
Table 6 The values considered for hyperparameters for different tree models
Table 7 Hyperparameters accepted by SVMs with different kernels for classification experiments
Table 8 Hyperparameters accepted by SVMs with different kernels for regression experiments
Table 9 The values considered for hyperparameters for different SVM models

Explainability

We assume that if a model is capable of predicting metabolic stability well, then the features it uses might be relevant in determining the true metabolic stability. In other words, we analyse machine learning models to shed light on the underlying factors that influence metabolic stability. To this end, we use the SHapley Additive exPlanations (SHAP) [33]. SHAP allows to attribute a single value (the so-called SHAP value) for each feature of the input for each prediction. It can be interpreted as a feature importance and reflects the feature’s influence on the prediction. SHAP values are calculated for each prediction separately (as a result, they explain a single prediction, not the entire model) and sum to the difference between the model’s average prediction and its actual prediction. In case of multiple outputs, as is the case with classifiers, each output is explained individually. High positive or negative SHAP values suggest that a feature is important, with positive values indicating that the feature increases the model’s output and negative values indicating the decrease in the model’s output. The values close to zero indicate features of low importance.

The SHAP method originates from the Shapley values from game theory. Its formulation guarantees three important properties to be satisfied: local accuracy, missingness and consistency. A SHAP value for a given feature is calculated by comparing output of the model when the information about the feature is present and when it is hidden. The exact formula requires collecting model’s predictions for all possible subsets of features that do and do not include the feature of interest. Each such term if then weighted by its own coefficient. The SHAP implementation by Lundberg et al. [33], which is used in this work, allows an efficient computation of approximate SHAP values.

In our case, the features correspond to presence or absence of chemical substructures encoded by MACCSFP or KRFP. In all our experiments, we use Kernel Explainer with background data of 25 samples and parameter link set to identity.

The SHAP values can be visualised in multiple ways. In the case of single predictions, it can be useful to exploit the fact that SHAP values reflect how single features influence the change of the model’s prediction from the mean to the actual prediction. To this end, 20 features with the highest mean absolute SHAP value are plotted side by side starting from the actual prediction and the most important feature at the top. The SHAP values of the remaining features are summed and plotted collectively at the bottom of the plot and ending at the model’s average prediction. In case of classification, this process is repeated for each of the model outputs resulting in three separate plotsone for each of the classes.

The SHAP values for multiple predictions can be averaged to discover general tendencies of the model. Initially, we filter out any predictions which are incorrect, because the features used to provide an incorrect answer are of little relevance. In case of classification, the class returned by the model must be equal to the true class for the prediction to be correct. In case of regression, we allow an error smaller or equal to 20% of the true value expressed in hours. Moreover, if both the true and the predicted values are greater than or equal to 7 h and 30 min, we also accept the prediction to be correct. In other words, we use the following condition: \(\widehat{y}\) is correct if and only if (0.8y ≤ \(\widehat{y}\)  ≤ 1.2y) or (y ≥ 7.5 and \(\widehat{y}\) ≥ 7.5), where y is the true half-lifetime expressed in hours, and \(\widehat{y}\) is the predicted value converted to hours. After finding the set of correct predictions, we average their absolute SHAP values to establish which features are on average most important. In case of regression, each row in the figures corresponds to a single feature. We plot 20 most important features with the most important one at the top of the figure. Each dot represents a single correct prediction, its colour the value of the corresponding feature (blueabsence, redpresence), and the position on the x-axis is the SHAP value itself. In case of classification, we group the predictions according to their class and calculate their mean absolute SHAP values for each class separately. The magnitude of the resulting value is indicated in a bar plot. Again, the most important feature is at the top of each figure. This process is repeated for each output of the modelas a result, for each classifier three bar plots are generated.

Hyperparameter details

The hyperparameter details are gathered in Tables 3, 4, 5, 6, 7, 8, 9: Table 3 and Table 4 refer to Naïve Bayes (NB), Table 5 and Table 6 to trees and Table 7, Table 8, and Table 9 to SVM.

Description of the GitHub repository

All scripts are available at https://github.com/gmum/metstab-shap/. In folder ‘models’ there are scripts which can be used to train the models presented in our work and in folder ‘metstab_shap’, the implementation to reproduce the full results, which includes hyperparameter tuning and calculation of SHAP values. We encourage the use of the experiment tracking platform Neptune (https://neptune.ai/) for logging the results, however, it can be easily disabled. Both datasets, the data splits and all configuration files are present in the repository. The code can be run with the use of Conda environment, Docker container or Singularity container. The detailed instructions to run the code are present in the repository.