A ‘rule of 0.5’ for the metabolite-likeness of approved pharmaceutical drugs

We exploit the recent availability of a community reconstruction of the human metabolic network (‘Recon2’) to study how close in structural terms are marketed drugs to the nearest known metabolite(s) that Recon2 contains. While other encodings using different kinds of chemical fingerprints give greater differences, we find using the 166 Public MDL Molecular Access (MACCS) keys that 90 % of marketed drugs have a Tanimoto similarity of more than 0.5 to the (structurally) ‘nearest’ human metabolite. This suggests a ‘rule of 0.5’ mnemonic for assessing the metabolite-like properties that characterise successful, marketed drugs. Multiobjective clustering leads to a similar conclusion, while artificial (synthetic) structures are seen to be less human-metabolite-like. This ‘rule of 0.5’ may have considerable predictive value in chemical biology and drug discovery, and may represent a powerful filter for decision making processes. Electronic supplementary material The online version of this article (doi:10.1007/s11306-014-0733-z) contains supplementary material, which is available to authorized users.


Introduction
The declining productivity of the drug discovery process is well known (e.g. Empfield and Leeson 2010;Hay et al. 2014;Kell 2013;Kola 2008;Kola and Landis 2004;Rafols et al. 2014;van der Greef and McBurney 2005). Thus, many groups have sought to assess in silico those structural or biophysical properties of successful drugs that might be used as filters to enrich the contents of drug discovery libraries with molecules that share those properties. This has therefore led to concepts such as ''drug-likeness'' (e.g. Empfield and Leeson 2010;Hay et al. 2014;Kell 2013;Kola 2008;Kola and Landis 2004;van der Greef and McBurney 2005), ''lead-likeness'' (Gozalbes and Pineda-Lucena 2011;Holdgate 2007;Oprea et al. 2007Oprea et al. , 2001Wunberg et al. 2006), and ''ligand efficiency'' (Hopkins et al. 2014) by which the potentially desirable properties of such molecules have been assessed.
We recognise that any molecule bioactive in human cells (whether as a drug or for purposes of chemical genomics) must cross at least one membrane, that nutrients necessarily do so, that natural products remain a major source of successful (marketed) pharmaceutical drugs (Gozalbes and Pineda-Lucena 2011;Holdgate 2007;Oprea et al. 2007Oprea et al. , 2001van Deursen et al. 2011;Wunberg et al. 2006), and that successful drugs require or at least use membrane transporters (Dobson et al. 2009;Dobson and Kell 2008;Giacomini and Huang 2013;Giacomini et al. 2010;Kell 2013;Kell and Dobson 2009;Kell et al. 2013Kell et al. , 2011Kell and Goodacre 2014;Lanthaler et al. 2011) that normally are used for the transport of intermediary metabolites (Herrgård et al. 2008;Swainston et al. 2013;Thiele et al. 2013). Given the natural role for these transporters as transporters of intermediary metabolites, we and others have thus suggested (hypothesised) that successful drugs are in fact much more like metabolites (we use this term to mean the natural intermediary metabolites of human metabolism, and do not consider metabolites of the drugs) than are the typical structures found in drug discovery libraries (e.g. Chen et al. 2012;Dobson et al. 2009;Feher and Schmidt 2003;Gupta and Aires-de-Sousa 2007;Hamdalla et al. 2013;Karakoc et al. 2006;Ranganathan 2009, 2011;Peironcely et al. 2011;Walters 2012;Zhang et al. 2011), and following the principle of molecular similarity (e.g. Bender and Glen 2004;Eckert and Bajorath 2007;Gasteiger 2003;Maldonado et al. 2006;Oprea 2004;Sheridan et al. 2004) that ''metabolite-likeness'' is therefore a useful criterion for the design of successful drugs (Dobson et al. 2009). At one level, this may not be seen as surprising given the fact that pharmaceutical drugs typically bind to proteins at sites to which endogenous metabolites normally bind, but the recognition of the importance of metabolite-likeness in drug discovery and chemical genomics remains less than complete.
While a variety of metabolite (pathway) databases exist (Ooi et al. 2010) [e.g. ChEBI (de Matos et al. 2012Degtyarenko et al. 2009;Hastings et al. 2013), HMDB (Wishart et al. 2013), KEGG (Kanehisa et al. 2012(Kanehisa et al. , 2014, MetaCyc (Altman et al. 2013;Caspi et al. 2014;Karp and Caspi 2011) and MetaboLights (Haug et al. 2013)], the recent availability of a highly curated consensus map (Recon2) of the human metabolic network (and thus of intermediary metabolites) (Swainston et al. 2013;Thiele et al. 2013) now provides the most suitable starting point for the comparison of drugs that have been approved/ marketed [available from DrugBank (Knox et al. 2011;Law et al. 2014)] and metabolites that are known to be part of the human metabolic network. We choose this latter over say HMDB since the measurable presence of a molecule in a human sample (e.g. Dunn et al. 2014) does not exclude that it has a nutritional, xenobiotic or gut microbial origin, and HMDB does contain many 'metabolites' that are not in fact produced via pathways containing proteins encoded by the human genome. Indeed Peironcely et al. (2011) noted, for instance, that the 'metabolite' debrisoquine was indeed classified in their scheme as a nonmetabolite (and it is indeed a marketed drug).
Thus the primary purpose of this work (in contrast to our earlier work (Dobson et al. 2009) that included multiple metabolite databases that were not constrained as here), is to use the availability of Recon2 to assess precisely how 'metabolite-like' known drugs are, partly as an aid to developing metrics for determining whether drugs are likely to be substrates for relevant transporters and thus whether they are likely to be bioactive. The availability of Recon2 also allows us to reason sensibly about the nature and extent of metabolite space and how it differs from the kinds of molecules typically found in drug discovery libraries.

Construction of datasets
The list of FDA-approved small molecule drugs was downloaded from DrugBank 3.0 (http://www.drugbank.ca/ downloads) in November 2013 as an SDF file and consists of 1491 molecules. This is significantly smaller than the fuller list (7330 'drugs' via Drugbank and KEGGDrug) used previously (Dobson et al. 2009). The list of intermediary metabolites was extracted from the latest version of the Recon2 human metabolic network (Thiele et al. 2013). A further manual curation removed from the 'drugs' list (i) 'drugs' (mainly nutritional supplements) that are also intermediary metabolites produced by enzymes encoded by the genome and thus part of Recon 2 (though adrenaline was treated as a drug), and (ii) those 'metabolites' listed in Recon2 that are xenobiotic in nature or simply metals or salts. However, vitamins and essential amino acids and fatty acids, while not encoded by the human genome, were retained as 'metabolites' as they are both necessary for human metabolism and form part of the formal human metabolic network. The resultant data are in Supplementary information S3, and consist of 1113 'metabolites ' [cf. 5333 'metabolites' previously (Dobson et al. 2009)] and 1381 'drugs'. In addition, data on antimalarial compounds were downloaded from the databases at the EBI (https:// www.ebi.ac.uk/chemblntd).

Software
For the cheminformatics analyses we used the KoNstanz Information MinEr (KNIME, www.knime.org) (Beisken et al. 2013;Berthold et al. 2007;Mazanetz et al. 2012;Meinl et al. 2012;Stöter et al. 2013;Warr 2012). KNIME is a workflow environment somewhat similar to Taverna [with which we have previous experience in systems biology analyses (Li et al. 2008a, b)], but which is slightly more focussed on cheminformatics. The workflows we used here included nodes that made use of libA-nnotationSBML (Swainston and Mendes 2009), the Chemistry Development Kit (Beisken et al. 2013;Steinbeck et al. 2003) and the RDKit (Riniker and Landrum 2013a;Saubern et al. 2011) (www.rdkit.org/). We also used the software MOCK (Handl and Knowles 2007) for multiobjective clustering.

Comparison of Tanimoto distances between drugs and natural metabolites
Our first task was to assess the average chemical (structure) distances between molecules according to a suitable metric. Many molecular descriptors exist for encoding molecules in a manner that allows this (e.g. Bender 2010; Duan et al. 2010;Koutsoukas et al. 2013;Sastry et al. 2010;Sheridan and Kearsley 2002;Todeschini and Consonni 2000;Wang and Bajorath 2010), most commonly referred to as fingerprints (e.g. Faulon and Bender 2010;Flower 1998) and sometimes with rather different properties and outcomes when matched against structures or biological activities (e.g. Dhanda et al. 2013;Medina-Franco and Maggiora 2014). Thus, and while some experience shows that they are not greatly different from each other when simply comparing chemical or structural similarity (Dobson et al. 2009;Riniker and Landrum 2013a), which is the focus of the present paper, we looked at a number of methods for producing molecular fingerprints. Probably most common are fingerprints derived from structural keys such as the 166 Public MDL (Molecular ACCess System) MACCS keys (Durant et al. 2002) based on a predefined dictionary of 166 substructures [that contain most of the important features of a larger 960-key set (McGregor and Pallai 1997)] and hashed to give 1,024 bits. Given the molecular fingerprint method chosen, there is a more general acceptance of the metrics for the similarity of molecules whose (sub)structures are so encoded; although it has a size-dependence (that does not matter for this analysis), the Tanimoto distance, that effectively encodes the numbers of matching and non-matching substructures, is both easy to calculate and pre-eminent Willett 2006).
We recognise that some 20 % of recent new chemical entities are prodrugs (15 % in the top 100 drugs) (Huttunen et al. 2011), and that some of these are converted nonenzymically to the active substances; however, these normally do not differ greatly in structural terms from the active substance in the marketed entities, so for convenience we shall use the latter. In contrast to Peironcely et al. (2011), who used supervised learning methods such as random forests [which are very powerful (Knight et al. 2009)] to predict whether a substance was or was not a metabolite, we are here interested only in the structural similarities between candidate molecules and Recon2 metabolites, and we confine ourselves strictly to unsupervised methods of analysis.
We checked a variety of implementations of the MACCS fingerprints (specifically those used in Open Babel, CDK and RDKit) and found very little difference between them, and for what is presented here we used those in the RDKit implementation. We therefore compared all metabolites against all metabolites (Fig. 1a), all drugs against all drugs (Fig. 1b), and all drugs against all metabolites (Fig. 1c). The metabolite-metabolite similarities ( Fig. 1a) reveal multiple clusters, including one that is made up of CoA derivatives (full details in Figure S1), while the clusters of drug-drug similarities Fig. 1b are rather more heterogeneous (the trees are much 'bushier'). From Fig. 1c, the drug-metabolite similarities, there are some interesting clusters, e.g. the block of red and yellow towards the upper left represented sterols and steroids, while the larger swathe of red and yellow towards the bottom represents mainly CoA derivatives. All the data are given in an addressable form as Excel spreadsheets in Supplementary Information S1-S3.
A number of different fingerprints were used to determine if the extent of closeness of a drug to its nearest metabolite depended greatly on the fingerprint used. The various fingerprints used (http://www.rdkit.org/RDKit_ Docs.current.pdf) were provided in the RDKit module (Riniker and Landrum 2013a) (https://code.google.com/p/ rdkit/wiki/FingerprintsInTheRDKit) of KNIME (http:// tech.knime.org/community/rdkit), and as stated in (Riniker and Landrum 2013b) were atom pairs (AP), feature-based circular fingerprint with radius 2 as bit vector (FeatMor-gan2), and a circular fingerprint with radius 2 as bit vector (Morgan2). Morgan2 is the RDKit implementation of the familiar ECFP4, and FeatMorgan2 is equivalent to FCFP4 (Landrum et al. 2011). The features used by the RDKit for FeatMorgan2 consist of various donors, acceptors, aromatic atoms, halogens, basic and acidic atoms. We also used a representation (referred to in KNIME and here as 'RDKit') that is said to be a 'Daylightlike' topological fingerprint based on hashing molecular subgraphs. Most recently, RDKit has added some extra fingerprints, and for completeness we included these too. Thus, 'layered' is an experimental substructure fingerprint using hashed molecular subgraphs, while 'torsion' is said to be the bit vector topological-torsion fingerprint for a molecule. As indicated above, all of the data are tabulated in Fig S3. Considering first just the Tanimoto similarity (TS) values using MACCS fingerprints and the 1,024 bitstring encoding, 90 % of marketed drugs have a 'nearest metabolite Tanimoto similarity' (NMTS, i.e. the TS to the nearest metabolite) of more than 0.5, 98.5 % over 0.4 and 99 % over 0.34, all highly significant values (Baldi and Fig. 1 Heat maps of the overall similarities between a Recon2 metabolites, b drugs and c each other. In the latter plot, the drugs lie on the X-axis and the metabolites on the Y-axis. Chemical structures were encoded using the MACCS encoding and Tanimoto distances calculated as described in Methods. The heat map representation (Eisen et al. 1998) encodes the numbers as a colour; in the present version, for ease of observation, we use ten discrete colours for the ten decades of Tanimoto similarity, with the colours chosen following the recommendations of Brewer et al. (1997) (see also http://www.colorbrewer2.org/). Also shown are hierarchical clusterings of the rows and columns (Eisen et al. 1998) using complete linkage and the default settings in the hclust function in R (Color figure online) Nasr 2010). The first of those percentages compares with just 12 % when we did not use the 'genuine' human metabolites of Recon2 (Dobson et al. 2009) (note that there we used the nearest Tanimoto distance (=1 -TS)). Provided the molecule is not excessively halogenated, its NMTS is over 0.5 (e.g. 0.54 for Chlorzoxazone, 0.55 chlormerodrin, 0.6 diclofenac, 0.65 chlorphenesin and so on). This 'rule', by which the very great majority (90 % of) drugs are within a Tanimoto distance of 0.5 in MACCS fingerprint space, may be viewed in the context of the wellknown 'rule of 5' (Lipinski et al. 1997) (Ro5) mnemonic for predicting drug lead quality. However, the cumulative plots of the NMTS for each drug using different fingerprints (Fig. 2a) do differ quite significantly depending on which fingerprint is used, and clearly the well-established MACCS fingerprints lead to a substantially greater degree of 'metabolite-likeness' than do almost all the other encodings (we do not pursue this here). Figure 2 also permits one to read off other metrics such as to note that more than 50 % of drugs have a TS greater than 0.6 to a metabolite for both MACCS and RDKit encodings.
Another indication of the rather different nature of the fingerprints comes from an analysis (Table 1) of the nature, and frequency of occurrence, of the nearest metabolite, where each fingerprint encoding has its own predilections for particular classes of metabolite, reflected also in the overall number of metabolites that are closest to at least one drug. These represent about one quarter of all drugs (or metabolites), an indication of the significant heterogeneity (Hopkins et al. 2014;Paolini et al. 2006) of drug space. RDKit has a slightly unusual predilection for cob(1)alamin and for protoheme, returned as the closest hits on 650 and 73 occasions, respectively (although removing these has negligible effects on the shape of the plot in Fig. 2a, indicating that this lower degree of metabolite-likeness, which is a continuous function, is inherent to the encoding). Scatter plots indicating correlations of 'nearest metabolites' with the different encodings are given in Fig. 2b, again illustrating the substantial differences found using the different encodings. Thus we would stress not only that similarity measures differ significantly for the different encodings, but that in functional terms the well-known existence of activity cliffs (e.g. Maggiora et al. 2014) means that quite small differences in molecular similarity may be highly significant with regard to pharmacological effects. In contrast to studies of related molecules that look at this (e.g. Muchmore et al. 2008;Papadatos et al. 2010), we discuss only the similarities themselves.
In a similar vein, the different encodings produce quite different assessments of the number of metabolites to  which each drug displays a Tanimoto similarity exceeding 0.5 (Fig. 2c), with (unsurprisingly, given the data in Fig. 2a) the MACCS, RDKit and Layered encodings showing the greatest tendency towards 'metabolite-likeness'. Based on MACCS, 50 % of marketed drugs have at least 31 metabolites with a TS of 0.5 or more. The 'winner' (i.e. the drug with the most metabolites to which it bears a TS greater than or equal to 0.5) is arbekacin, with 364, and the relevant data, plus a few named drugs, are given in Fig. 2d. It is probably worth commenting, albeit this is not necessarily a surprising finding, that these 'highly metabolite-like' drugs are natural products or molecules derived therefrom [see also (Kell 2013;Newman and Cragg 2012)]. The average greatest TS to a metabolite of the five most drug-like drugs (0.547), the five least drug-like drugs (0.683), the five most drug-like Ro5 failures (0.496) and the five least drug-like Ro5 passes (0.557, but minus tegaserod, not present in our list) as listed by Bickerton et al. (2012) are as noted. By contrast, the substance with the lowest NMTS (perflutren, 0.125) is in fact an injectable contrast agent of lipid microspheres marketed precisely because it does not enter cells, while the next three lowest (NTS B 0.2) are halothane (an inhalational narcotic), lindane (a topical chlorinated insecticide) and desflurane (a polyfluorinated inhalational anaesthetic), consistent with the fact that virtually no natural human metabolites are halogenated. Ten of the 14 least metabolite-like drugs contain at least two halogens (Fig. 2e).
In a similar vein, it is possible to enquire as to which metabolites have the most or fewest marketed drugs closely associated with them in terms of Tanimoto similarity, the latter in particular as a possible indication of areas of chemical space that might be deemed to be relatively underexplored. The metabolites with the very lowest TS to drugs are small and uninteresting (ammonia, water, etc.), so Fig. 2f illustrates those metabolites that are least similar to numbers of drugs between 900 and 1,000, at the same time illustrating the nonlinearity of drug and metabolite spaces by encoding with colours those metabolites that nonetheless have 1-5 drugs with a TS greater than or equal to 0.9 (glycerol is marked and has one, viz. mannitol). One might consider the sparsely populated areas of 'metabolitelikeness space' to be ones worth pursuing in drug discovery.
Another means of displaying the data, and a convenient means of interrogating them for a drug of interest, is given in Fig. 3, where we display the Tanimoto similarity to all metabolites for the beta-(adrenergic receptor) blocker propranolol. All metabolites with a TS greater than 0.5 are labelled, and structures are shown for (from left to right) propranolol itself, (-)-salsoline, adrenaline, L-normetanephrine, metanephrine and norepinephrine. While 'structural similarity' may be seen as a subjective matter, in this case the chemical similarities are obvious, and it is probably not surprising that a beta-adrenergic antagonist should have similarities of this type.

Multiobjective clustering of drugs and metabolites
In the above, we clustered (or bi-clustered) the drugs and the metabolites separately. Another approach to assessing the mapping of drug and metabolite spaces, and the extent to which they overlap or otherwise), is to use clustering methods of both together. These algorithms differ widely [there is no single 'correct' clustering (Everitt 1993)] but the state of the art is represented by methods such as MOCK (Handl and Knowles 2007) (MultiObjective Clustering with automatic K) that use multiple objectives [specifically both closeness and connectivity (Handl and Knowles 2007;Handl et al. 2005)] simultaneously to cluster objects on the basis of their 'similarity'. As with any multiobjective method, there are multiple 'best' solutions represented by a Pareto front (Kell 2012), and we illustrate this in Fig. 4. Figure 4a shows the overall variation of 'optimal' cluster number for the Pareto front, with 'knees' at e.g. 3, 7, 25, 30, 42 and 64 clusters, while Fig. 4b shows the distribution of drugs and metabolites in the MOCK solution for 25 clusters. Also marked are the 'top ten' blockbuster drugs by sales from 2010 [NB fluticasone propionate and salmeterol are part of a combined medicine; see also ], while the colour encodes the cluster membership of compounds when there are only seven clusters. Cluster 0 is mainly small metabolites like bicarbonate, but it is evident that the lower clusters all contain both metabolites and drugs. We also looked at the distribution of various molecular properties (such as polar surface area, molecular mass, log P etc.) between clusters, but no trends nor hotspots were apparent for particular clusters (not shown).

The drug-likeness of synthetic 'druglike' molecules and 'fragments' and of natural products
Having seen the closeness of successful, marketed drugs to metabolites when both are MACCS-encoded, it was important to establish that (while unlikely) this was not a strange artefact of the MACCS encoding itself. To this end, and while we and others (e.g. Dobson et al. 2009;Feher and Schmidt 2003;Khanna and Ranganathan 2011;Medina-Franco and Maggiora 2014;Ohno et al. 2010) have recognised that marketed drugs do differ structurally from most molecules in drug discovery libraries, despite their 'biogenic bias' (Hert et al. 2009), we sought to see how similar such non-marketed drug molecules or compounds are to marketed drugs when we compare them in the same way. The comparison is not entirely favourable to metabolites since we already know (Fig. 2) that many of the very smallest metabolite molecules are simply not druglike, and this is reflected in the data of Fig. 5. Figure 5a shows a heat map relating 2,000 structures taken randomly from the 30,000 in the Maybridge fragment library (similar kinds of map were obtained using subsets of varying sizes up to 15,000) relative to marketed drugs, while Fig. 5b shows that of a random subset of the Maybridge library vs Recon2 metabolites. Figure 5c shows the cumulative similarities (all using MACCS encodings) to metabolites for a collection of molecules from a subset of 1,000 molecules from the Maybridge fragment library, from the 13,533 compounds in the Tres Cantos Antimalarial Drug Set (Gamo  A 'rule of 0.5' for use in drug discovery 333   (Guiguemde et al. 2010) (note that these last two are in fact 'hits' or actives), for 3 subsets of 1,000 molecules from ZINC (Irwin et al. 2012), and of 1,000 from the *2,400 natural products molecules in StreptomedB (Lucas et al. 2013). We also checked to ensure that we are not biased systematically towards an appearance of metabolite-likeness by say differences in distributions of molecular weights in the different sets, and Fig. 5d shows that we are not, in that a propensity to metabolite-likeness does not seem to follow systematically the MW distribution of the libraries. It is interesting to note that the Novartis and GSK compounds, selected from a very much larger set on the basis of their bioactivity, were even slightly more 'drug-like' than were those from Recon 2 at the left-hand end, though Recon 2 was most drug-like overall (note how it and the streptomycete secondary metabolites 'pull away' from the other curves beyond 50 %, Fig. 5c), and it seems that no such 'MACCS artefact' contributes to the 'rule of 0.5'. Interestingly, Maybridge tends to contain a rather greater diversity of structures relative to human metabolites, but it is possible that the libraries might be enriched further for possible drugs if they were to include a greater degree of metabolite-likeness. It will obviously be of future interest to determine which fragments or compounds are enriched in molecules that happen to possess particular bioactivities.

Discussion and conclusions
While both drug and drug target spaces are evidently very heterogeneous (e.g. Adams et al. 2009;Hopkins et al. 2014;Medina-Franco and Maggiora 2014;Paolini et al. 2006), and that is reflected in the analyses presented here, it is highly desirable to be able to find properties that are well represented in marketed (and hence effective and successful) drugs. Given the complexity of drug space, finding a simple mnemonic or rule that has utility is to be welcomed. Indeed, the original 'rule of 5' paper states (Lipinski et al. 1997) ''This analysis led to a simple mnemonic which we called the 'rule of 5' because the cutoffs for each of the four parameters were all close to 5 or a multiple of 5….The 'rule of 5' states that: poor absorption or permeation are more likely when: there are more than 5 H-bond donors (expressed as the sum of OHs and NHs); The MWT is over 500; the Log P is over 5 (or M Log P is over 4.15); there are more than 10 H-bond acceptors (expressed as the sum of Ns and Os); compound classes that are substrates for biological transporters are exceptions to the rule.'' This famous 'rule of 5' (Lipinski et al. 1997) has been highly influential in this regard, but only about 50 % of orally administered new chemical entities actually obey it (Overington et al. 2006;Zhang and Wilkinson 2007) (and see Hopkins et al. 2014); indeed half of recent 'new chemical entities' are natural products (Newman and Cragg 2012), that do not obey the Ro5 either. The (also very effective) 'rule of three' (Congreve et al. 2003) applies solely to leads and not drugs. While improving drug effectiveness is probably best addressed using combinations of molecules (e.g. Small et al. 2011), we have shown that when encoded using the public MDL MACCS keys, more than 90 % of individual marketed drugs obey a 'rule of 0.5' mnemonic, elaborated here, to the effect that a successful drug is likely to lie within a Tanimoto distance of 0.5 of a known human metabolite. While this does not mean, of course, that a molecule obeying the rule is likely to become a marketed drug for humans, it does mean that a molecule that fails to obey the rule is statistically most unlikely to do so. We note that this highlighting of the utility of 'metabolite-likeness' as a concept in drug discovery in systems pharmacology is just a first step, as the availability of Recon2 for such analyses open up many new avenues that we do not discuss here. The present analysis has necessarily been retrospective, as we have applied it to existing and successful (i.e. presently marketed) drugs. However, we consider that this rule, and the concept of the utility of metabolite-likeness more generally, may well have significant prospective value in reversing a current trend in medicinal chemistry (Chen et al. 2012;Walters et al. 2011) that runs in a direction precisely opposite to that of metabolite-likeness.