RD-Metabolizer: an integrated and reaction types extensive approach to predict metabolic sites and metabolites of drug-like molecules
- 1.3k Downloads
Experimental approaches for determining the metabolic properties of the drug candidates are usually expensive, time-consuming and labor intensive. There is a great deal of interest in developing computational methods to accurately and efficiently predict the metabolic decomposition of drug-like molecules, which can provide decisive support and guidance for experimentalists.
Here, we developed an integrated, low false positive and reaction types extensive metabolism prediction approach called RD-Metabolizer (Reaction Database-based Metabolizer). RD-Metabolizer firstly employed the detailed reaction SMARTS patterns to encode different metabolism reaction types with the aim of covering larger chemical reaction space. 2D fingerprint similarity calculation model was built to calculate the metabolic probability of each site in a molecule. RDKit was utilized to act on pre-written reaction SMARTS patterns to correct the metabolic ranking of each site in a molecule generated by the 2D fingerprint similarity calculation model as well as generate corresponding structures of metabolites, thus helping to reduce the false positive metabolites. Two test sets were adopted to evaluate the performance of RD-Metabolizer in predicting SOMs and structures of metabolites. The results indicated that RD-Metabolizer was better than or at least as good as several widely used SOMs prediction methods. Besides, the number of false positive metabolites was obviously reduced compared with MetaPrint2D-React.
KeywordsSites of metabolism (SOMs) Metabolites Reaction SMARTS patterns 2D fingerprint similarity
site of metabolism
molecular interaction fields
density functional theory
support vector machine
liquid chromatography/tandem mass spectrometry
Epidermal Growth Factor Receptor
It is significant to know how drug candidates are metabolized in the body at early stages of the drug discovery process, because both the drug safety and efficacy profiles are greatly affected by human metabolism . The drug-like molecules can be either metabolized into their active forms to actually interact with the therapeutic targets, or converted into inactively execrable metabolites . In addition, the metabolic modifications can also bring toxicity, which is one of the major reasons for failure in drug development. Furthermore, metabolic liability is also related to other critical issues, for example drug–drug interactions, food–drug interactions and drug resistance [2, 3, 4]. Therefore, it is of great importance to determine the metabolic properties of the drug candidates earlier. However, experimental approaches for determining those properties are usually expensive, time-consuming and labor intensive . Thus, there is a great deal of interest in developing computational methods to accurately and efficiently predict the metabolic decomposition of drug-like molecules [6, 7, 8, 9].
The investigations of SOMs and structures of metabolites are two main research directions of computer-aided metabolism prediction methods, which can provide decisive support and guidance for experimentalists . The prediction methods of SOMs usually have higher prediction accuracy. For example, MetaSite , a commercial software package, utilizes GRID-derived molecular interaction fields (MIFs) of protein and ligand, protein structural information, and molecular orbital calculations to estimate the likelihood of metabolic reaction at a certain atom position, with a success rate of 85% for tagging a known SOM among the top-2 ranked atom positions. Rydberg et al. [12, 13, 14] implemented SMARTCyp as a fast SOMs predictor. The predictor contains a reactivity lookup table of pre-calculated density functional theory (DFT) activation energies for plenty of ligand fragments that are undergoing a CYP3A4 or CYP2D6 mediated transformation. SMARTCyp performs a fast reactivity lookup for the query compound, in conjunction with a topological accessibility descriptor to provide a final SOM ranking. As a result, SMARTCyp identified 76% of SOMs over a dataset of 394 compounds with the top-2 metric. RegioSelectivity (RS)-predictor is developed by Zaretzki et al. [15, 16], which employs a set of 392 quantum chemical atom-specific and 148 topological descriptors, and a support vector machine (SVM)-like ranking in combination with a multiple instance learning method to determine potential SOMs. Using the top-2 metric, 78% of SOMs were identified over a test set of 394 compounds. MetaPrint2D [17, 18, 19, 20] identifies the reaction center atoms for the substrates recorded in biotransformation database through the maximum common substructure method. Each substrates atom and reaction center atom is encoded in a six-level topological fingerprint. Therefore, two fingerprint databases are yielded in this process. For a query molecule, it is firstly converted into fingerprints, then the fingerprint of each atom is matched against the above two fingerprint databases. By comparing the similarity of fingerprint, the number of hits in each database can be counted. Finally, the metabolic likelihood of each atom in the query molecule is derived. About 70–80% of SOMs in the test compounds are correctly predicted among the three highest-scored atom positions. Quite impressive results can be obtained by these computational methods, however, most of these approaches are limited to CYP450 catalyzed reactions, and only labile sites rather than structures of metabolites can be predicted. Moreover, predicted SOMs are not equivalent to identifying the correct biotransformation that would take place at a certain atom position, and they provide no information about which reaction type will take place. Therefore, these limitations make it difficult to draw any quantitative conclusions on the metabolic liability of a certain molecule . Besides, these methods are also less suitable for routine use to support experimental identification of metabolites.
Predicting the structures of metabolites by computational approaches in advance can decisively help medicinal chemists analyze the experimentally-determined mass spectrometry (MS) data or liquid chromatography/tandem mass spectrometry (LC–MS/MS) data to pinpoint the actual SOMs . However, only very few computational methods to predict structures of metabolites have been developed so far. These prediction approaches are usually clustered into three categories: expert systems, fingerprint-based data mining approaches and combined approaches. Expert systems mainly employ generic metabolic rules derived by expert to predict structures of metabolites. Typical examples of expert systems are META [22, 23, 24], MetabolExpert , Meteor , SyGMa , TIMES . For the fingerprint-based data mining approaches, MetaPrint2D-React , an extension of MetaPrint2D, is a typical and representative method. It is and allows users to predict structures of metabolites on the basis of generic metabolic reaction rules. Tarcsay et al.  firstly adopt the best setup of the expert system MetabolExpert  to generate possible metabolites for the query compound. Then the docking program GLIED  as a postprocessing filter is employed to reduce the false positive rate. This combined approach brings a success rate of 69% for identifying the correct metabolites among the three highest-ranked structures. Although these methods have an advantage in speed or correctly generating structures of metabolites, there still exist several challenges. The main drawback of expert system is the combinatorial explosion problem, because all possible combinations of metabolic rules permitted by the reaction rule sets are considered. The disadvantage of fingerprint-based data mining method is that generic metabolic transformation rules are so simple that they cannot describe complex reaction types and cannot cover larger chemical reaction space. The method combined with docking is impractical for many applications, due to its time-consuming and structure-dependent features.
The main contribution of this work is a description of Reaction Database-based Metabolizer (RD-Metabolizer), an integrated, low false positive and reaction types extensive approach for predicting metabolic sites and metabolites of drug-like molecules. In order to cover larger chemical reaction space, the detailed reaction SMARTS patterns were firstly employed to describe simple and complex reactions recorded in the biotransformation databases. 2D fingerprint similarity calculation model was built to calculate the metabolic probability of each site in a molecule. Meanwhile, RDKit , an open-source chemical information software, was utilized to act on pre-written reaction SMARTS patterns to correct the metabolic ranking of each site in a molecule and generate corresponding structures of metabolites. In comparison studies, RD-Metabolizer performed slightly better than or at least as good as several widely used SOMs prediction methods in terms of SOMs prediction accuracy. And compared with other metabolite prediction method, the number of false positive metabolites generated by RD-Metabolizer was also obviously reduced. A specific metabolism prediction example of AZD9291  further indicated its robustness in SOMs identification and metabolites generation, and also confirmed its potential applications for metabolism prediction.
Dataset used in the present study was extracted from MDL metabolic reaction database  and integrity database , which both included metabolic transformations of xenobiotic compounds harvested from the literatures. The dataset generation procedure was as follows: (1) repeated reactions were handled (only used single-step and unique reactions to avoid data redundancy); (2) molecules in reactions must have a complete chemical structure, thus reactions that reactant or product had “R” substituents or free radical were excluded; (3) reactions that reactant or product was invalid were processed (i.e. reactant or product was labeled with “No Structure”); (4) chelation reactions and reactions with ambiguous reaction centers were also excluded (No reaction SMARTS pattern could express these reactions); (5) reactions that reactant or product was a single element (i.e. metallic element) were removed. Finally, 63,620 individual metabolic reactions were retained as the metabolic reaction dataset for further study.
Preparation of test sets
We randomly selected 425 different substrate molecules from the metabolic reaction dataset to be internal test set (test set 1). After remove the metabolic reaction records of these 425 substrate molecules, the rest of the metabolic reaction records were used to generate the two topological fingerprint databases required by RD-Metabolizer. The external test set (test set 2) compiled by Zaretzkiet et al.  was used for further method validation. For the external test set, some structures were found identical to those in our training sets, and thus removed. As a result, the external test set contained 173 compounds. Besides, all the test compounds were carefully checked to ensure the correctness of their 2D structures. Wrong structures were corrected by manually searching different databases, such as DrugBank  and PubChem .
Identification of SOMs and generation of reaction SMARTS patterns
Generation of fingerprint databases
Occurrence ratio calculator
After generation of the topological atom fingerprints for the query compound, the fingerprint of each atom in query compound was matched against the two fingerprint databases. In the present work, we built a 2D fingerprint similarity calculation model to calculate the metabolic occurrence ratio of each atom in the query compound. The similarity calculation model was composed of three similarity operators, namely Exact match operator, Soergel metric operator [43, 44] and Hamming metric operator , to compare the fingerprint matrices. In order to compute fast and ensure the existence of cored substructures that are key for determining whether the two fingerprint are similar, the Exact match operator was firstly performed, which requires the layers in two fingerprint matrices to be exactly the same (top three layers were adopted in our method), thus the fingerprints that do not match top three layers can be rejected quickly. Then, the Soergel metric operator and the Hamming metric operator were employed. Finally, the number of similar fingerprints in each database was counted.
In this study, two fingerprints were considered to be similar if the scoring function d total ≤ 3.5 [d total was range from 0 (identity) to ∞ (maximum diversity)]. When d total ≤ 3.5, the false negatives were the least for a set of tested fingerprints.
In our study, we used the following division rules to distinguish the metabolic possibilities : very unlikely, 0 ≤ p < 0.15; unlikely, 0.15 ≤ p < 0.33; likely, \( 0.33 \le p < 0.66 \); very likely, 0.66 ≤ p < 1.00.
Results and discussion
In order to correctly predict structures of possible metabolites of the query compound, SOMs should be correctly identified at first. Benefited by the combination of the 2D similarity calculation model and the pre-written reaction SMARTS patterns, the SOMs and metabolites prediction performance of RD-Metabolizer are investigated.
Prediction of metabolic sites
There are two main methods to evaluate the prediction performance of SOMs: qualitative analysis and quantitative analysis [12, 16, 17, 46, 47]. Qualitative analysis mainly rely on visual inspection, namely, the predicted results of a method is compared with the known metabolic sites of the molecules. Quantitative analysis refers to the percentage of molecules for which at least one of the top k (usually k = 1–3) ranked sites is an experimentally observed SOM. However, this index often depends on the size of the molecules, and the number of metabolic sites, which will result in a tendentious prediction. Prediction of SOMs can be treated as a classification problem: each site in a molecule is either a metabolic site or not. Therefore, in order to overcome the bias of top k metric, an overall measurement index called area under the curve (AUC) of the receiver operating characteristic (ROC) for SOMs prediction assessment is proposed . This method was also applied in our study.
As for test set 2, the top-k (k = 1–3) prediction rates of RD-Metabolizer are better than MetaPrint2D and RS-predictor (Fig. 3b). Although the top-1 prediction rate of RD-Metabolizer for test set 2 is inferior to SMARTCyp, both top-2 and top-3 prediction rates of RD-Metabolizer are comparable with SMARTCyp. Compared to the top-2 and top-3 prediction rates, there may be three reasons causing the difference in the top-1 prediction rate of RD-Metabolizer and SMARTCyp. Firstly, the definition of SOMs between RD-Metabolizer (reaction SMARTS pattern-based) and SMARTCyp (mechanism-based) is different. For example, in the case of N-/O-dealkylation, RD-Metabolizer ranks the heteroatom higher than the carbon atom, while SMARTCyp takes the carbon atom that connect to the heteroatom as reaction center. Secondly, RD-Metabolizer is a fingerprint similarity-based method, and predictions cannot be performed about novel atomic sites where the topological fingerprint does not exist in the two databases we built. Thirdly, after examination, it is found that compounds in the test set 2 are mainly involved in phase I metabolism, while two fingerprint databases of RD-Metabolizer we built contain fingerprints of both phase I and phase II metabolic sites. Therefore, some polar sites of the compounds in test set 2 may bring impact on the final metabolic site rankings.
Prediction of structures of metabolites
At the same time, because the development of RD-Metabolizer was aimed at decreasing the number of false positive metabolites in predictions, we counted the total numbers of false positive metabolites for all molecules in the test set, with corresponding SOMs of these metabolites ranking in the top-k (k = 1, 3) position.
Prediction results of the metabolites for test set 1
Test set 1
More importantly, the number of false positive metabolites generated by RD-Metabolizer was far lower than the number that generated by MetaPrint2D-React, indicating that we have already realized the anticipated purpose of developing RD-Metabolizer (Table 2). Some factors were responsible for the generation of false positive metabolites. RD-Metabolizer is one of the fingerprint-based data mining approaches, thus there may exist combination explosion problems for some reactions. For example, molecules containing phenolic hydroxyl group will be cleared from the body by making conjunction with one or more endogenous cofactors, such as glucose acid, sulfonic acid, amino acid, acetyl coenzyme A and glutathione. RD-Metabolizer was insensitive to the different chemical environments of the phenolic hydroxyl groups. It applied all conjugation reactions about phenolic hydroxyl groups in the databases for the query compound, and thus resulting in many unexpected metabolites. Therefore, it was extremely important for this category of metabolic reactions to be further refined and split by reaction SMARTS patterns to decrease the number of false positive metabolites. In additions, the incorrectly predicted SOMs also became the causes of the generation of unexpected metabolites. The accuracy of our method was largely influenced by the diversity of the fingerprint database we built. If the query molecule had some novel atomic fingerprint environments that are exactly the reaction centers, RD-Metabolizer would assign these atoms a normalized occurrence ratio of 0.0. Therefore, some other atoms in the molecule would have higher (than zero) normalized occurrence ratio and be top-ranked, even when the likelihood of their being a metabolic sites is very low . Subsequently, some false positive metabolites would be generated.
The influence of the number of fingerprint layers on prediction results
Compared with the 2D fingerprint similarity model built in SPORCalc (former version of MetaPrint2D) , an exact match operator was introduced to establish the fingerprint similarity model in our method. The exact match operator required that the corresponding top three rows in two fingerprint matrices are exactly the same. A site where metabolic reaction occurs was usually affected by its surrounding environment. Therefore, the introduction of the exact match operator was mainly to ensure the existence of small and identical surrounding environments for the reaction centers. Besides, the use of exact match operator for the top three layers of the fingerprint matrices was in accordance with the writing habit of the reaction SMARTS patterns for the fingerprint environments of the reaction centers, leading to improved computational efficiency.
Influence of molecular size
Significance of detailed reaction SMARTS pattern
Examples of the expressions of detailed reaction SMARTS pattern for complex ring reactions
Reaction SMARTS pattern
This work described RD-Metabolizer, an integrated, low false positive and reaction types extensive approach to predict metabolic sites and metabolites of drug-like molecules. The detailed reaction SMARTS patterns were firstly employed to encode different metabolism reaction types with the aim of covering larger chemical reaction space. RDKit was utilized to act on pre-written reaction SMARTS patterns to correct the metabolic ranking of each site in a molecule generated by the 2D fingerprint similarity calculation model as well as to generate the corresponding structures of metabolites. These are critical procedures, as they can meet the integrated and low false positive goals. By comparing with other widely used methods, it is found that RD-Metabolizer has better or comparable performance in predicting SOMs and produces fewer false positive metabolites. In addition, a specific example concerning AZD9291, which is a mutant-selective EGFR inhibitor, was conducted to further illustrate the prediction accuracy and efficiency of RD-Metabolizer. In summary, RD-Metabolizer will serve as a useful toolkit for the early metabolic properties assessment of lead compounds and drug candidates at the preclinical stage of drug discovery.
JM, SL developed the method and drafted the manuscript. JM, SL and XL interpreted data and performed the evaluation. MZ and HL designed research and approved the final manuscript. All authors read and approved the final manuscript.
This work was supported by the National Natural Science Foundation of China (Grant 81230090), the National Key Research and Development Program (Grant 2016YFA0502304), and Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) under Grant No. U1501501. Shiliang Li is supported by China Postdoctoral Science Foundation (Grant 2016M600290).
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 7.Afzelius L, Arnby CH, Broo A, Carlsson L, Isaksson C, Jurva U, Kjellander B, Kolmodin K, Nilsson K, Raubacher F, Weidolf L (2007) State-of-the-art tools for computational site of metabolism predictions: comparative analysis mechanistical insights and future applications. Drug Metab Rev 39:61–86CrossRefGoogle Scholar
- 17.Adams SE (2010) Molecular Similarity and Xenobiotic Metabolism. Ph.D thesis, University of Cambridge, Cambridge UKGoogle Scholar
- 20.MetaPrint2D version 1.0 (2010) Unilever Centre for Molecular Science Informatics University of Cambridge, Cambridge UKGoogle Scholar
- 21.Hao CC Campbell S, Stranz D, McSweeney N (2004) Identification of in vitro metabolites of indinavir using automated LC/MS/MS acquisition, in-silico prediction and structure-based data analysis. In: Proceedings of the 52nd ASMS conference 2004 Nashville (USA)Google Scholar
- 25.Darvas F (1987) In MetabolExpert: an expert system for predicting metabolism of substances. Kaiser KLE, D Reidel Publishing Co., Dordrecht Holland, pp 71–81Google Scholar
- 30.Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH, Shelley M, Perry JK, Shaw DE, Francis P, Shenkin PS (2004) Glide: a new approach for rapid accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47:1739–1749CrossRefGoogle Scholar
- 31.Landrum G RDKit: Open-source cheminformatics. http://www.rdkit.org. Accessed 2 Sep 2014
- 32.Finlay MRV, Anderton M, Ashton S, Ballard P, Bethel PA, Box MR, Bradbury RH, Brown SJ, Butterworth S, Campbell A (2014) Discovery of a potent and selective EGFR inhibitor (AZD9291) of both sensitizing and T790M resistance mutations that spares the wild type form of the receptor. J Med Chem 57:8249–8267CrossRefGoogle Scholar
- 33.Accelrys Metabolite Database version 2011.2 (2011) Accelrys Inc., San Diego, CAGoogle Scholar
- 36.Yanli W, Jewen X, Tugba OS, Jian Z, Jiyao W, Stephen HB (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 37:623–633Google Scholar
- 38.Daylight Chemical Information Systems Inc (2006) http://www.daylight.com/dayhtml/doc/theory/index.html. Accessed 31 Jan 2015
- 41.SYBYL Molecular Modeling Software: Tripos Associates Inc., St Louis, MO, USAGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.