A ‘rule of 0.5’ for the metabolite-likeness of approved pharmaceutical drugs
We exploit the recent availability of a community reconstruction of the human metabolic network (‘Recon2’) to study how close in structural terms are marketed drugs to the nearest known metabolite(s) that Recon2 contains. While other encodings using different kinds of chemical fingerprints give greater differences, we find using the 166 Public MDL Molecular Access (MACCS) keys that 90 % of marketed drugs have a Tanimoto similarity of more than 0.5 to the (structurally) ‘nearest’ human metabolite. This suggests a ‘rule of 0.5’ mnemonic for assessing the metabolite-like properties that characterise successful, marketed drugs. Multiobjective clustering leads to a similar conclusion, while artificial (synthetic) structures are seen to be less human-metabolite-like. This ‘rule of 0.5’ may have considerable predictive value in chemical biology and drug discovery, and may represent a powerful filter for decision making processes.
KeywordsGenome-wide metabolic reconstruction Recon 2 Cheminformatics KNIME Metabolite-likeness Drug-likeness
The declining productivity of the drug discovery process is well known (e.g. Empfield and Leeson 2010; Hay et al. 2014; Kell 2013; Kola 2008; Kola and Landis 2004; Rafols et al. 2014; van der Greef and McBurney 2005). Thus, many groups have sought to assess in silico those structural or biophysical properties of successful drugs that might be used as filters to enrich the contents of drug discovery libraries with molecules that share those properties. This has therefore led to concepts such as “drug-likeness” (e.g. Empfield and Leeson 2010; Hay et al. 2014; Kell 2013; Kola 2008; Kola and Landis 2004; van der Greef and McBurney 2005), “lead-likeness” (Gozalbes and Pineda-Lucena 2011; Holdgate 2007; Oprea et al. 2007, 2001; Wunberg et al. 2006), and “ligand efficiency” (Hopkins et al. 2014) by which the potentially desirable properties of such molecules have been assessed.
We recognise that any molecule bioactive in human cells (whether as a drug or for purposes of chemical genomics) must cross at least one membrane, that nutrients necessarily do so, that natural products remain a major source of successful (marketed) pharmaceutical drugs (Gozalbes and Pineda-Lucena 2011; Holdgate 2007; Oprea et al. 2007, 2001; van Deursen et al. 2011; Wunberg et al. 2006), and that successful drugs require or at least use membrane transporters (Dobson et al. 2009; Dobson and Kell 2008; Giacomini and Huang 2013; Giacomini et al. 2010; Kell 2013; Kell and Dobson 2009; Kell et al. 2013, 2011; Kell and Goodacre 2014; Lanthaler et al. 2011) that normally are used for the transport of intermediary metabolites (Herrgård et al. 2008; Swainston et al. 2013; Thiele et al. 2013). Given the natural role for these transporters as transporters of intermediary metabolites, we and others have thus suggested (hypothesised) that successful drugs are in fact much more like metabolites (we use this term to mean the natural intermediary metabolites of human metabolism, and do not consider metabolites of the drugs) than are the typical structures found in drug discovery libraries (e.g. Chen et al. 2012; Dobson et al. 2009; Feher and Schmidt 2003; Gupta and Aires-de-Sousa 2007; Hamdalla et al. 2013; Karakoc et al. 2006; Khanna and Ranganathan 2009, 2011; Peironcely et al. 2011; Walters 2012; Zhang et al. 2011), and following the principle of molecular similarity (e.g. Bender and Glen 2004; Eckert and Bajorath 2007; Gasteiger 2003; Maldonado et al. 2006; Oprea 2004; Sheridan et al. 2004) that “metabolite-likeness” is therefore a useful criterion for the design of successful drugs (Dobson et al. 2009). At one level, this may not be seen as surprising given the fact that pharmaceutical drugs typically bind to proteins at sites to which endogenous metabolites normally bind, but the recognition of the importance of metabolite-likeness in drug discovery and chemical genomics remains less than complete.
While a variety of metabolite (pathway) databases exist (Ooi et al. 2010) [e.g. ChEBI (de Matos et al. 2012; Degtyarenko et al. 2009; Hastings et al. 2013), HMDB (Wishart et al. 2013), KEGG (Kanehisa et al. 2012, 2014), MetaCyc (Altman et al. 2013; Caspi et al. 2014; Karp and Caspi 2011) and MetaboLights (Haug et al. 2013)], the recent availability of a highly curated consensus map (Recon2) of the human metabolic network (and thus of intermediary metabolites) (Swainston et al. 2013; Thiele et al. 2013) now provides the most suitable starting point for the comparison of drugs that have been approved/marketed [available from DrugBank (Knox et al. 2011; Law et al. 2014)] and metabolites that are known to be part of the human metabolic network. We choose this latter over say HMDB since the measurable presence of a molecule in a human sample (e.g. Dunn et al. 2014) does not exclude that it has a nutritional, xenobiotic or gut microbial origin, and HMDB does contain many ‘metabolites’ that are not in fact produced via pathways containing proteins encoded by the human genome. Indeed Peironcely et al. (2011) noted, for instance, that the ‘metabolite’ debrisoquine was indeed classified in their scheme as a non-metabolite (and it is indeed a marketed drug).
Thus the primary purpose of this work (in contrast to our earlier work (Dobson et al. 2009) that included multiple metabolite databases that were not constrained as here), is to use the availability of Recon2 to assess precisely how ‘metabolite-like’ known drugs are, partly as an aid to developing metrics for determining whether drugs are likely to be substrates for relevant transporters and thus whether they are likely to be bioactive. The availability of Recon2 also allows us to reason sensibly about the nature and extent of metabolite space and how it differs from the kinds of molecules typically found in drug discovery libraries.
2.1 Construction of datasets
The list of FDA-approved small molecule drugs was downloaded from DrugBank 3.0 (http://www.drugbank.ca/downloads) in November 2013 as an SDF file and consists of 1491 molecules. This is significantly smaller than the fuller list (7330 ‘drugs’ via Drugbank and KEGGDrug) used previously (Dobson et al. 2009). The list of intermediary metabolites was extracted from the latest version of the Recon2 human metabolic network (Thiele et al. 2013). A further manual curation removed from the ‘drugs’ list (i) ‘drugs’ (mainly nutritional supplements) that are also intermediary metabolites produced by enzymes encoded by the genome and thus part of Recon 2 (though adrenaline was treated as a drug), and (ii) those ‘metabolites’ listed in Recon2 that are xenobiotic in nature or simply metals or salts. However, vitamins and essential amino acids and fatty acids, while not encoded by the human genome, were retained as ‘metabolites’ as they are both necessary for human metabolism and form part of the formal human metabolic network. The resultant data are in Supplementary information S3, and consist of 1113 ‘metabolites’ [cf. 5333 ‘metabolites’ previously (Dobson et al. 2009)] and 1381 ‘drugs’. In addition, data on antimalarial compounds were downloaded from the databases at the EBI (https://www.ebi.ac.uk/chemblntd).
For the cheminformatics analyses we used the KoNstanz Information MinEr (KNIME, www.knime.org) (Beisken et al. 2013; Berthold et al. 2007; Mazanetz et al. 2012; Meinl et al. 2012; Stöter et al. 2013; Warr 2012). KNIME is a workflow environment somewhat similar to Taverna [with which we have previous experience in systems biology analyses (Li et al. 2008a, b)], but which is slightly more focussed on cheminformatics. The workflows we used here included nodes that made use of libAnnotationSBML (Swainston and Mendes 2009), the Chemistry Development Kit (Beisken et al. 2013; Steinbeck et al. 2003) and the RDKit (Riniker and Landrum 2013a; b; Saubern et al. 2011) (www.rdkit.org/). We also used the software MOCK (Handl and Knowles 2007) for multiobjective clustering.
3.1 Comparison of Tanimoto distances between drugs and natural metabolites
Our first task was to assess the average chemical (structure) distances between molecules according to a suitable metric. Many molecular descriptors exist for encoding molecules in a manner that allows this (e.g. Bender 2010; Duan et al. 2010; Koutsoukas et al. 2013; Sastry et al. 2010; Sheridan and Kearsley 2002; Todeschini and Consonni 2000; Wang and Bajorath 2010), most commonly referred to as fingerprints (e.g. Faulon and Bender 2010; Flower 1998) and sometimes with rather different properties and outcomes when matched against structures or biological activities (e.g. Dhanda et al. 2013; Medina-Franco and Maggiora 2014). Thus, and while some experience shows that they are not greatly different from each other when simply comparing chemical or structural similarity (Dobson et al. 2009; Riniker and Landrum 2013a), which is the focus of the present paper, we looked at a number of methods for producing molecular fingerprints. Probably most common are fingerprints derived from structural keys such as the 166 Public MDL (Molecular ACCess System) MACCS keys (Durant et al. 2002) based on a predefined dictionary of 166 substructures [that contain most of the important features of a larger 960-key set (McGregor and Pallai 1997)] and hashed to give 1,024 bits.
Given the molecular fingerprint method chosen, there is a more general acceptance of the metrics for the similarity of molecules whose (sub)structures are so encoded; although it has a size-dependence (that does not matter for this analysis), the Tanimoto distance, that effectively encodes the numbers of matching and non-matching substructures, is both easy to calculate and pre-eminent (Maggiora et al. 2014; Willett 2006).
We recognise that some 20 % of recent new chemical entities are prodrugs (15 % in the top 100 drugs) (Huttunen et al. 2011), and that some of these are converted non-enzymically to the active substances; however, these normally do not differ greatly in structural terms from the active substance in the marketed entities, so for convenience we shall use the latter. In contrast to Peironcely et al. (2011), who used supervised learning methods such as random forests [which are very powerful (Knight et al. 2009)] to predict whether a substance was or was not a metabolite, we are here interested only in the structural similarities between candidate molecules and Recon2 metabolites, and we confine ourselves strictly to unsupervised methods of analysis.
A number of different fingerprints were used to determine if the extent of closeness of a drug to its nearest metabolite depended greatly on the fingerprint used. The various fingerprints used (http://www.rdkit.org/RDKit_Docs.current.pdf) were provided in the RDKit module (Riniker and Landrum 2013a) (https://code.google.com/p/rdkit/wiki/FingerprintsInTheRDKit) of KNIME (http://tech.knime.org/community/rdkit), and as stated in (Riniker and Landrum 2013b) were atom pairs (AP), feature-based circular fingerprint with radius 2 as bit vector (FeatMorgan2), and a circular fingerprint with radius 2 as bit vector (Morgan2). Morgan2 is the RDKit implementation of the familiar ECFP4, and FeatMorgan2 is equivalent to FCFP4 (Landrum et al. 2011). The features used by the RDKit for FeatMorgan2 consist of various donors, acceptors, aromatic atoms, halogens, basic and acidic atoms. We also used a representation (referred to in KNIME and here as ‘RDKit’) that is said to be a ‘Daylightlike’ topological fingerprint based on hashing molecular subgraphs. Most recently, RDKit has added some extra fingerprints, and for completeness we included these too. Thus, ‘layered’ is an experimental substructure fingerprint using hashed molecular subgraphs, while ‘torsion’ is said to be the bit vector topological-torsion fingerprint for a molecule. As indicated above, all of the data are tabulated in Fig S3.
Summary of the most frequently represented ‘closest metabolite’ to FDA-approved drugs, the number of times they appear, and the number of metabolites that are closest to a drug at least once
Most common ‘closest metabolite’
Total number of different metabolites that are ‘closest’ to a marketed drug at least once
Linoleic coenzyme A
Vaccenyl coenzyme A
In a similar vein, the different encodings produce quite different assessments of the number of metabolites to which each drug displays a Tanimoto similarity exceeding 0.5 (Fig. 2c), with (unsurprisingly, given the data in Fig. 2a) the MACCS, RDKit and Layered encodings showing the greatest tendency towards ‘metabolite-likeness’. Based on MACCS, 50 % of marketed drugs have at least 31 metabolites with a TS of 0.5 or more. The ‘winner’ (i.e. the drug with the most metabolites to which it bears a TS greater than or equal to 0.5) is arbekacin, with 364, and the relevant data, plus a few named drugs, are given in Fig. 2d. It is probably worth commenting, albeit this is not necessarily a surprising finding, that these ‘highly metabolite-like’ drugs are natural products or molecules derived therefrom [see also (Kell 2013; Newman and Cragg 2012)]. The average greatest TS to a metabolite of the five most drug-like drugs (0.547), the five least drug-like drugs (0.683), the five most drug-like Ro5 failures (0.496) and the five least drug-like Ro5 passes (0.557, but minus tegaserod, not present in our list) as listed by Bickerton et al. (2012) are as noted.
By contrast, the substance with the lowest NMTS (perflutren, 0.125) is in fact an injectable contrast agent of lipid microspheres marketed precisely because it does not enter cells, while the next three lowest (NTS ≤ 0.2) are halothane (an inhalational narcotic), lindane (a topical chlorinated insecticide) and desflurane (a polyfluorinated inhalational anaesthetic), consistent with the fact that virtually no natural human metabolites are halogenated. Ten of the 14 least metabolite-like drugs contain at least two halogens (Fig. 2e).
In a similar vein, it is possible to enquire as to which metabolites have the most or fewest marketed drugs closely associated with them in terms of Tanimoto similarity, the latter in particular as a possible indication of areas of chemical space that might be deemed to be relatively underexplored. The metabolites with the very lowest TS to drugs are small and uninteresting (ammonia, water, etc.), so Fig. 2f illustrates those metabolites that are least similar to numbers of drugs between 900 and 1,000, at the same time illustrating the nonlinearity of drug and metabolite spaces by encoding with colours those metabolites that nonetheless have 1–5 drugs with a TS greater than or equal to 0.9 (glycerol is marked and has one, viz. mannitol). One might consider the sparsely populated areas of ‘metabolite-likeness space’ to be ones worth pursuing in drug discovery.
3.2 Multiobjective clustering of drugs and metabolites
3.3 The drug-likeness of synthetic ‘druglike’ molecules and ‘fragments’ and of natural products
4 Discussion and conclusions
While both drug and drug target spaces are evidently very heterogeneous (e.g. Adams et al. 2009; Hopkins et al. 2014; Medina-Franco and Maggiora 2014; Paolini et al. 2006), and that is reflected in the analyses presented here, it is highly desirable to be able to find properties that are well represented in marketed (and hence effective and successful) drugs. Given the complexity of drug space, finding a simple mnemonic or rule that has utility is to be welcomed. Indeed, the original ‘rule of 5’ paper states (Lipinski et al. 1997) “This analysis led to a simple mnemonic which we called the ‘rule of 5’ because the cutoffs for each of the four parameters were all close to 5 or a multiple of 5….The ‘rule of 5’ states that: poor absorption or permeation are more likely when: there are more than 5 H-bond donors (expressed as the sum of OHs and NHs); The MWT is over 500; the Log P is over 5 (or M Log P is over 4.15); there are more than 10 H-bond acceptors (expressed as the sum of Ns and Os); compound classes that are substrates for biological transporters are exceptions to the rule.” This famous ‘rule of 5’ (Lipinski et al. 1997) has been highly influential in this regard, but only about 50 % of orally administered new chemical entities actually obey it (Overington et al. 2006; Zhang and Wilkinson 2007) (and see Hopkins et al. 2014); indeed half of recent ‘new chemical entities’ are natural products (Newman and Cragg 2012), that do not obey the Ro5 either. The (also very effective) ‘rule of three’ (Congreve et al. 2003) applies solely to leads and not drugs. While improving drug effectiveness is probably best addressed using combinations of molecules (e.g. Small et al. 2011), we have shown that when encoded using the public MDL MACCS keys, more than 90 % of individual marketed drugs obey a ‘rule of 0.5’ mnemonic, elaborated here, to the effect that a successful drug is likely to lie within a Tanimoto distance of 0.5 of a known human metabolite. While this does not mean, of course, that a molecule obeying the rule is likely to become a marketed drug for humans, it does mean that a molecule that fails to obey the rule is statistically most unlikely to do so. We note that this highlighting of the utility of ‘metabolite-likeness’ as a concept in drug discovery in systems pharmacology is just a first step, as the availability of Recon2 for such analyses open up many new avenues that we do not discuss here. The present analysis has necessarily been retrospective, as we have applied it to existing and successful (i.e. presently marketed) drugs. However, we consider that this rule, and the concept of the utility of metabolite-likeness more generally, may well have significant prospective value in reversing a current trend in medicinal chemistry (Chen et al. 2012; Walters et al. 2011) that runs in a direction precisely opposite to that of metabolite-likeness.
D. B. K. conceived the study, N. S. contributed to and extracted the data from Recon2, S. O’. H. constructed all of the workflows, J. H. performed the multi-objective clustering, and all authors wrote the paper. D. B. K. thanks the University of Manchester and the Biotechnology and Biological Sciences Research Council (BBSRC) for financial support. N. S. thanks the BBSRC for funding under Grant BB/K019783/1.
Conflict of interest
The authors declare that they have no financial or other conflicts of interest.
- Bender, A., & Glen, R. C. (2004). Molecular similarity: A key technique in molecular informatics. Organic & Biomolecular Chemistry, 2, 3204–3218.Google Scholar
- Berthold, M. R., et al. (2007). The Konstanz Information Miner. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, & R. Decker (Eds.), Studies in classification, data analysis, and knowledge organization (GfKL 2007) (pp. 319–326). Heidelberg: Springer.Google Scholar
- Chen, H. M., Engkvist, O., Blomberg, N., & Li, J. (2012). A comparative analysis of the molecular topologies for drugs, clinical candidates, natural products, human metabolites and general bioactive compounds. Medchemcomm, 3, 312–321.Google Scholar
- Degtyarenko, K., Hastings, J., de Matos, P., Ennis, M. (2009). ChEBI: An open bioinformatics and cheminformatics resource. Current Protocols in Bioinformatics. Chapter 14, Unit 14–9.Google Scholar
- Dunn, W. B., et al. (2014). Molecular phenotyping of a UK population: Defining the human serum metabolome. Metabolomics, 1, 18.Google Scholar
- Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of National Academy of Sciences, 95, 14863–14868.Google Scholar
- Everitt, B. S. (1993). Cluster analysis. London: Edward Arnold.Google Scholar
- Faulon, J.-L., & Bender, A. (Eds.). (2010). Handbook of chemoinformatics algorithms. London: CRC.Google Scholar
- Flower, D. R. (1998). On the properties of bit string-based measures of chemical similarity. Journal of Chemical Information and Computer Sciences, 38, 379–386.Google Scholar
- Gasteiger, J. (2003). Handbook of Chemoinformatics: From data to knowledge. Weinheim: Wiley/VCH.Google Scholar
- Gozalbes, R., & Pineda-Lucena, A. (2011). Small molecule databases and chemical descriptors useful in chemoinformatics: An overview. Combinatorial Chemistry & High Throughput Screening, 14, 548–558.Google Scholar
- Hamdalla, M. A., Mandoiu,. I. I., Hill, D. W., Rajasekaran, S., & Grant, D. F. (2013). BioSM: Metabolomics tool for identifying endogenous mammalian biochemical structures in chemical structure space. Journal of Chemical Information and Modeling, 53, 601–612. doi: 10.1021/ci300512q.PubMedCentralPubMedGoogle Scholar
- Handl, J., & Knowles, J. (2007). An evolutionary approach to multiobjective clustering. IEEE Transactions on Evolutionary Computation, 11, 56–76.Google Scholar
- Kell, D. B., Dobson, P. D. (2009). The cellular uptake of pharmaceutical drugs is mainly carrier-mediated and is thus an issue not so much of biophysics but of systems biology. In M. G. Hicks, & C. Kettner (Eds.), Proceedings of International Beilstein Symposium on Systems Chemistry (pp. 149–168). Berlin: Logos. http://www.beilstein-institut.de/Bozen2008/Proceedings/Kell/Kell.pdf.
- Koutsoukas, A., et al. (2013). How diverse are diversity assessment methods? A comparative analysis and benchmarking of molecular descriptor space. Journal of Chemical Information and Modeling. doi: 10.1021/ci400469u.
- Lipinski, C. A., Lombardo, F., Dominy, B. W., & Feeney, P. J. (1997). Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews, 23, 3–25.Google Scholar
- Medina-Franco, J. L., & Maggiora, G. M. (2014). Molecular similarity analysis. In J. Bajorath (Ed.), Chemoinformatics for drug discovery (pp. 343–399). Hoboken: Wiley.Google Scholar
- Meinl, T., Jagla, B., Berthold, M. R. (2012). Integrated data analysis with KNIME. Open source software in life science research: Practical solutions in the pharmaceutical industry and beyond, pp. 151–171. doi: 10.1533/9781908818249.
- Oprea, T. I. (2004). Chemoinformatics in drug discovery. Weinheim: Wiley/VCH.Google Scholar
- Todeschini, R., & Consonni, V. (2000). Handbook of molecular descriptors. Weinheim: Wiley-VCH Verlag GmbH.Google Scholar
- Wang, Y., & Bajorath, J. (2010). Advanced fingerprint methods for similarity searching: Balancing molecular complexity effects. Combinatorial Chemistry & High Throughput Screen, 13, 220–228.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.