Introduction

Genome-wide association studies (GWAS) in patients and populations test relationships between natural sequence variation (genotype) and disease risk factors, biomarkers, and clinical endpoints using population-based cohort and case–control studies.

A well-established role of Mendelian randomization (MR) analysis is to use genetic variants (mostly identified from GWAS) as instrumental variables to identify which disease biomarkers (e.g., blood lipids such as low- and high-density lipoprotein cholesterol (HDL-C) and triglycerides (TG)) may be causally related to disease endpoints (e.g., coronary heart disease; CHD)1,2. We and others have shown that variants in a gene encoding a specific drug target, that alters the target’s expression or function, can be used as a tool to anticipate the effect of drug action on the same target. We have referred to this application of MR as “drug target MR”3. In contrast to a genome-wide biomarker MR, where the variants comprising the genetic instrument are selected from across the genome, in a drug target MR analysis, variants are selected from the gene of interest or neighboring genomic region because these variants are most likely to associate with the expression or function of the encoded protein (acting in cis). Whereas genome-wide biomarker MR helps infer the causal relevance of a biomarker for a disease, a drug target MR helps infer whether and, in certain cases in what direction, a drug that acts on the encoded protein, whether an antagonist, agonist, activator, or inhibitor, will alter disease risk (Supplementary Table 1).

Genome-wide biomarker MR studies have validated the causal role of elevated low-density lipoprotein cholesterol (LDL-C) on CHD risk, supporting the findings from randomized controlled trials of different LDL-C lowering drug classes4,5,6,7,8,9. However, such studies have been equivocal on the role of HDL-C and TG in CHD4,5. Clinical trials of these lipid fractions have also been seemingly contradictory. For example, using niacin to raise HDL-C did not reduce CHD risk10, but inhibiting cholesteryl ester transfer protein (CETP) with anacetrapib, which also raises HDL-C, was effective in preventing CHD events11. However, a drug target MR of CETP on CHD, using variants in the CETP gene weighted by their effect on HDL-C, indicates protection from disease (odds ratio (OR): 0.87; 95% CI: 0.84–0.90)3. The finding is consistent with the effect of allocation to the CETP-inhibitor anacetrapib in a placebo-controlled trial (0.93; 95% CI: 0.86–0.99) and is compatible with the view that targeting CETP is an effective therapeutic approach to prevent CHD (Fig. 1)11. Importantly, as discussed in detail elsewhere3, drug target MR analyses which use genetic associations with “biomarkers” downstream to the protein, such as HDL-C, use this effect as a proxy for protein concentration or activity (where this has not been measured directly) and do not provide evidence on whether the biomarker used for the weighting itself mediates disease. Rather, they inform on the validity of the drug target for a disease, regardless of the mediating pathway.

Fig. 1: HDL-C, CETP inhibitor, and CHD: genome-wide biomarker vs. drug target MR.
figure 1

Forest plot of the HDL-C biomarker MR estimate (Holmes et al., 2015), drug target MR estimate of CETP level and function using HDL-C as a proxy (Schmidt et al., 2020), and odds ratio of anacetrapib clinical trial (HPS3/TIMI55–REVEAL Collaborative Group, 2017). OR odds ratio, CI confidence interval, SD standard deviation.

Taken together, these observations suggest that other similarly effective as yet unexploited drug targets might exist for the prevention or treatment of CHD that could be identified through their association with blood lipids even though such analyses should not presume that the effect on CHD is mediated through these lipids.

Here, we apply drug target MR on a set of druggable proteins identified through genetic associations with circulating blood lipids and assessed their causal relevance for CHD. To place the findings in context, we first re-evaluate causal effect estimates for LDL-C, HDL-C, and TG on CHD using “genome-wide biomarker MR”, based on summary statistics from GWAS of blood lipids and CHD. Next, we use these data to select genes associated with blood lipids that encode druggable targets and test the effects of these drug targets on CHD using “drug target MR” in two independent datasets. In parallel, we investigate if the genetic associations with each lipid sub-fraction and CHD are consistent with a shared causal variant using genetic co-localization. For a set of replicated, prioritized drug targets, we perform a phenome-wide scan of genetic associations of variants within the encoding gene region with additional disease biomarkers and endpoints beyond CHD. We source data from clinicaltrials.gov and the British National Formulary (BNF) for drugs in clinical phase development and approved medicines, respectively, to identify agents that might be pursued rapidly in clinical phase testing for treatment or prevention of CHD. Because of interest in this area, though not the focus of the work, we also evaluate potential mediators of these effects using multivariable MR (MVMR). Finally, we discuss how this approach might be generalized to other drug targets and clinical endpoints, providing a route to translating findings from GWAS into new drug development.

Results

Genome-wide biomarker MR analysis

Previous genome-wide biomarker MR studies have shown a causal effect of LDL-C and TG on CHD risk, while the causal role of HDL-C remains uncertain5. As an initial step, to confirm the robustness of our analytical pipeline and contextualize further analyses, we first replicated previously reported genome-wide biomarker MR estimates using genetic variants from the Global Lipid Genetic Consortium (GLGC)12 to instrument causal effects of the three lipid sub-fractions on CHD, using summary statistics from the CardiogramPlusC4D Consortium GWAS13. Causal estimates were obtained through univariable MR, with Egger horizontal pleiotropy correction applied through a model selection framework14. The OR for CHD per standard deviation (SD) higher concentration of the corresponding blood lipid fraction was 1.50 (95% confidence interval (CI): 1.39–1.63) for LDL-C, 0.95 (95% CI: 0.90–1.01) for HDL-C and 1.10 (95% CI: 1.01–1.21) for TG. These findings were replicated in an independent analysis using summary statistics from a GWAS meta-analysis of lipids measured using a nuclear magnetic resonance (NMR) spectroscopy platform15,16, and genetic associations with CHD risk derived from UK Biobank17. The OR for CHD per SD increase in LDL-C and TG in the replication dataset were 1.28 (95% CI: 1.25–1.31) and 1.23 (95% CI: 1.14–1.32), respectively, and 0.89 (95% CI: 0.83–0.96) per SD increase in HDL-C. This genome-wide biomarker MR estimates confirmed the previously reported causal effect of LDL-C and TG on CHD risk but illustrate the equivocal role of HDL-C. To account for the correlation between the lipid fractions and evaluate their direct independent effect on CHD, we performed an MVMR analysis in the discovery datasets, which assessed genetic associations with the three lipid subfractions and CHD risk in a single model. The MVMR analysis generated an OR of 1.53 (95% CI: 1.44–1.62) per SD higher LDL-C, 0.91 (95% CI: 0.86–0.95) per SD higher HDL-C, and 1.09 (95% CI: 1.01–1.17) per SD higher TG (Supplementary Table 2).

Drug target MR analysis

Drug target MR was used to determine the effect on CHD of perturbing druggable proteins that influence one or more of the three lipid fractions. First, genes previously shown to encode druggable proteins were selected in regions around variants associated with one or more of the major circulating lipid subfractions applying a P value < 1 × 10−6. This identified 341 genes; 149 for an association with LDL-C, 180 for HDL-C, and 154 for TG18. One hundred forty genes (41%) were associated with a single lipid subfraction, 101 (30%) were associated with two subfractions and 100 (29%) were associated with all three subfractions (Supplementary Fig. 1, Supplementary Data 1). Subsequently, we performed a drug target MR analysis on CHD accounting for genetic correlation between variants (see “Methods”). In the absence of direct measures of the encoded protein, we proxied the effect of genetic drug target perturbation through the downstream effect on one or more of the three lipid sub-fractions. Here, we used genetic associations with LDL-C, HDL-C, and TG as a proxy for drug target effects on CHD, which does not provide direct evidence on whether the drug target itself affects CHD through the leveraged lipid weight; this mediation question is subsequently explored using MVMR.

Of the 341 drug targets, 165 could be associated with CHD, with 131 of these estimates being consistent with a protective effect when instrumented for a reduction in LDL-C or TG and/or elevation in HDL-C (Fig. 2, Supplementary Data 2). When weighted by LDL-C, eighty-seven targets showed a significant effect on CHD after orientating towards an increasing LDL-C direction, with the first and third quartiles (Q) of the CHD OR of 1.93 and 3.32. Similarly, the Q1 and Q3 after orientating the OR toward an increasing HDL-C direction were 0.22 and 0.53 for the 49 significant HDL-C instrumented targets, and for the 49 significant TG instrumented targets these were 1.95 and 4.35, respectively.

Fig. 2: Discovery drug target MR estimates on CHD.
figure 2

Analyses were performed using genetic associations with LDL-C, HDL-C, and TG from the Global Lipid Genetic Consortium (GLGC) with CHD events from the CardiogramPlusC4D Consortium. Drug targets are grouped by clinical phase according to the ChEMBL database. Blue indicates a beneficial effect on CHD risk and red a detrimental effect per SD difference with respect to the indicated lipid sub-fraction. Significant estimates are indicated with an asterisk (*). Co-localization of genetic effects on the relevant lipid sub-fraction and CHD at the same locus is indicated by a square around the cell.

To assess the potential for false-positive results, the distribution of the exposure-specific P values was tested against the uniform distribution expected under the null hypothesis19. The Kolmogorov–Smirnov (KS) goodness-of-fit test was not consistent with the hypothesis that the observed findings could be readily explained by multiple testing (Supplementary Fig. 2).

Rediscoveries of indications and on-target adverse effects

We investigated if the drug target MR analysis rediscovered the mechanism of action of drugs with a license for lipid modification or compounds with a different indication but with reported lipid-related effects. To do so, compounds with reported lipid indications or adverse effects were extracted from the BNF website (https://bnf.nice.org.uk/), which comprises prescribing information for all UK licensed drugs. Out of the 341 druggable genes included in the analysis, five encoded the targets of drugs with a lipid-modifying indication (PCSK9, PPARG, PPARA, NPC1L1, and HMGCR) of which NPC1L1, HMGCR, and PCSK9 are targets of drugs used in CHD prevention; and six encoded a protein target of a drug with reported lipid-related adverse effects (ADRB1, TNF, ESR1, FRK, BLK, and DHODH) (Supplementary Data 3). To include outcome and side effect data of candidates in clinical phase development, the 341 drug targets were mapped to compound data available in a clinicaltrials.gov curated database. This database differentiates between endpoints monitored throughout the trial (“outcomes”) and unanticipated harmful episodes during the study that may be on-target or off-target effects of the trial agent (“adverse events”). Of the 341 drug targets, 23 had reported lipid-related outcomes and 40 had reported lipid-related adverse events (Supplementary Data 3).

The pool of druggable targets that were modeled using higher LDL-C as a proxy for the pharmacological action on a drug target included 14 targets of clinically used drugs, three of which were licensed for CHD treatment by lowering LDL-C (HMGCR, PCSK9, and NPC1L1). The non-CHD indications of clinically used drugs included dyslipidemias (PPARA), type 2 diabetes (PPARG and NDUFA13), autoimmune diseases (TNF), neoplasms (RAF1 and PSMA5), circulatory disorders (ABCA1, PLG, ITGB3, and F2), and alcohol-dependency (ALDH2) (Table 1). With the exception of F2, instrumenting the target action through a higher LDL-C effect was associated with a higher CHD risk. Two drug targets were for compounds already in phase 3 trials for CHD prevention (ANGPTL3 and CETP). Lastly, three targets were in phase 2 trials of compounds developed for other indications (CYP26A1, LTA, and LTB). The remaining 82 of the 101 targets had not yet been drugged by compounds in clinical phase development.

Table 1 Univariable drug target MR estimates for drug targets approved for indications other than lipid-lowering.

When using higher HDL-C as a proxy for pharmacological action, MR of four drug targets with compounds approved for non-CHD indications showed a directionally beneficial effect on CHD (VEGFA, PSMA5, CACNB1, and NISCH), suggesting potential for indication expansion (Table 1). Three were targets for drugs approved for non-CHD indications but which showed a potentially detrimental effect direction on CHD when instrumented through increasing HDL-C concentration (ESR1, ALOX5, and TUBB). Both CYP26A1 and CETP were associated with lower CHD risk when the effect on CHD was instrumented through an elevation of HDL-C. The remaining 65 of the 74 targets have not yet been drugged by compounds in clinical phase development.

Lastly, the set of druggable targets with compounds developed for non-CHD indications that were modeled using higher TG as a proxy for the pharmacological action on the target included PPARG, DHODH, VEGFA, TOP1, TUBB, NDUFA13, ABCA1, BLK, and F2 (Table 1). Of these, instrumenting the CHD effect through higher TG via drug action on BLK or F2 increased CHD risk. For the remaining targets, which included CETP, ANGPTL3, and CYP26A1, instrumenting the target effect through lowering TG levels decreased the risk of CHD, while the remaining 52 of the 64 targets have not been drugged by licensed compounds or clinical candidates yet.

Independent validation of the drug target MR estimates

To help verify the MR findings and reduce the multiple testing burden, an independent two-sample drug target MR analysis was conducted using summary statistics from a GWAS of blood lipids measured using an NMR spectroscopy platform15,16, and genetic associations with CHD risk derived from UK Biobank17. The validation analysis identified 47 significant MR estimates (P value < 0.05), of which 39/47 (83%) showed a concordant direction of effect with the initial analysis (Fig. 3) corresponding to 30 drug targets. Replicated targets included the licensed LDL-lowering drug targets PCSK9 and NPC1L1 (Table 2). While the majority of the replicated drug targets were anticipated to decrease CHD risk when instrumenting their effect through LDL-C concentration based on the univariable results, nine of the drug targets analyzed were significantly associated with lower CHD when the drug target effects were modeled through HDL-C and/or TG (Supplementary Fig. 3).

Fig. 3: Replication of drug target MR findings.
figure 3

The discovery and replication analyses used different data sources for both exposure and outcome. Totally, 145 replication MR analyses were performed in which the gene boundaries included genetic associations exceeding the pre-specified significance threshold (P value ≤ 1 × 10−4).

Table 2 Tissue specificity for replicated genes encoding drug targets.

Discriminating independent lipid effects using MVMR

After considering each lipid sub-fraction as a single measure on disease risk in the univariable drug target MR analyses, we performed a multivariable drug target MR (MVMR) analysis including LDL-C, HDL-C, and TG in a single model to account for potential pleiotropic effects of target perturbation via the other lipid sub-fractions and, in contrast to the previous univariable drug target MR, attempt to directly identify any potential lipid mediating pathway. Twenty-six of the replicated targets had sufficient data (3 or more variants) for the multivariable analysis. This analysis identified a single likely lipid fraction for 12 targets (SLC12A3, APOB, APOA1, PVRL2, APOE, APOC1, CELSR2, GPR61, PCSK9, and CEACAM16 through LDL-C; LPL through HDL-C; and ALDH1A2 through TG) (Supplementary Data 4). We found that SMARCA4 and APOA5 likely affected CHD through LDL-C and TG and that RPL7A likely affected CHD through LDL-C and HDL-C pathways. Due to the limited number of variants in VEGFA, CILP2, NDUFA13, and ANGPTL4, multivariable MR analysis could not distinguish the lipid fraction through which CHD was likely affected. Additionally, the presence of horizontal pleiotropy in the MVMR analysis based on heterogeneity tests suggested that PCSK9, LPL, APOC1, APOE, PVRL2, APOB, APOC3, CETP, APOA1, and CELSR2 may affect CHD through additional pathways beyond the lipid sub-fractions LDL-C, HDL-C, and TG included in the current model.

Co-localization between loci for lipids and CHD

Co-localization analyses are often performed to facilitate the mapping of genetic variants to causal genes in a disease GWAS by assessing whether associations with gene expression and disease outcome share a causal variant. Here, we applied co-localization analysis using blood lipids as an intermediate trait instead of gene expression data as a parallel validation step to assess if the genetic associations with each lipid sub-fraction and CHD were consistent with a shared causal variant20. Twenty-eight out of a total of 33 co-localization signals overlapped a significant finding in the discovery MR, which corresponded to 25 genes encoding a drugged or druggable target (Fig. 2). Moreover, 11 of the replicated drug targets showed evidence of co-localization between the lipid sub-fraction and CHD. These included 9 targets replicated for lowering LDL-C levels (SMARCA4, PVLR2, APOE, APOC1, CARM1, RPL7A, ADAMTS13, PCSK9, and C9orf96), one target replicated for raising HDL-C levels (LPL), and one target replicated for lowering TG levels (VEGFA).

Tissue expression to aid drug target prioritization

While many tissues are involved in lipid homeostasis, the liver is considered the mechanistic effector organ for many therapeutics targeting lipid metabolism21. To investigate if the replicated drug target genes were specifically expressed in the liver or any other particular tissue, we extracted their RNAseq expression profiles from the Human Protein Atlas22 and calculated two commonly used tissue specificity metrics: the tau and z-scores23. Whilst tau summarizes the overall tissue distribution of a given gene and helps to distinguish between broadly expressed housekeeping genes (tau = 0) and tissue-specific genes (tau = 1), z-scores quantify how elevated the expression of a particular gene is in a particular tissue compared to other tissues. Among the 30 replicated genes, 28 had available RNAseq data, of which 15 (54%) showed elevated expression in the liver compared to other tissues (z-score > 1) (Table 2, Supplementary Fig. 4). These genes included the known lipid-lowering drug target genes, PCSK9, and NPC1L1. Furthermore, eight genes were highly specific to the liver as indicated by high tau values (tau > 0.8). Other tissues showing elevated expression of the replicated drug target genes were gastrointestinal tissues such as the small intestine and colon (e.g., APOA4 and APOB) and kidney (SLC12A3). Regarding the expression distribution of the targets, 9 showed tau values below 0.5, indicating that they are broadly expressed and suggesting that, when developing a drug, the possibility of observing adverse effects increases24.

Phenome-wide scan of replicated drug target candidates

The identification of potential mechanism-based adverse effects of a target represents an important aspect when prioritizing clinical candidates in the drug development pipeline. To explore potential effects of target perturbation on clinical endpoints other than CHD (whether beneficial or adverse), we performed a phenome-wide scan in 102 disease traits of the 30 drug targets replicated via drug target MR (Methods, Fig. 4, Supplementary Figs. 533). The 102 traits were agnostically selected and represent the entire spectrum of clinical disease available from the Neale Lab UK Biobank release, with the exclusion of Coronary Artery Disease (ICD10 code: I25), and data from 23 publicly available GWAS with the largest sample sizes for such phenotypes. Besides genome-wide significant associations with diseases of the circulatory system, variants in six drug target genes showed genome-wide significant associations with type 2 diabetes (NDUFA13, CILP2, PVRL2, VEGFA, APOC1, and LPL), five with Alzheimer’s disease (APOC1, PVR, PVRL2, APOE, and CEACAM16), four with asthma (SMARCA4, CETP, VEGFA, and ALDH1A2) and four with gout (APOA1, APOC3, APOA4, and APOA5). Notably, the PheWAS rediscovered the mechanism of action of metformin, a drug targeting NDUFA13 and licensed for type 2 diabetes25.

Fig. 4: Prioritized target: lipoprotein lipase (LPL).
figure 4

a Genetic associations at the locus (±50 kbp) in black vs. genome-wide associations (gray, P value < 1 × 10−6 based on two-sided z-tests). The x-axis shows the per allele effect on the corresponding lipid expressed as mean difference (MD) from GLGC and the y-axis indicates the per allele effect on CHD expressed as log odds ratios (OR) from CardiogramPlusC4D. The marker size indicates the significance of the association with the lipid sub-fraction (P value). b Univariable and multivariable (drug target) cis-MR results presented as OR and 95% confidence intervals with lipid exposure (n = 188,577 individuals) and CHD outcome (n = 60,801 cases and 123,504 controls). An asterisk (*) indicates the MR estimates as being replicated, and a dagger (†) that the lipid effect and CHD signals are co-localized. c. Disease associations at the locus with 103 clinical endpoints from UK Biobank and GWAS Consortia.

Discussion

By combining publicly available GWAS datasets on blood lipids and CHD and applying MR approaches with drug information and clinical data, we have genetically validated and prioritized drug targets for CHD prevention.

We identified 131 drug target genes associated with CHD risk from a set of 341 druggable genes overlapping associations with one or more of the major blood lipid fractions. The set of targets included NPC1L1, HMGCR, and PCSK9, which are known targets of LDL-lowering drugs whose efficacy in CHD prevention has been proven in clinical trials. We performed an independent replication study both to corroborate the targets and the direction of the effects. We replicated the findings in independent datasets (UCLEB Consortium and UK Biobank) in which lipids were measured using a different platform (NMR spectroscopy in UCLEB) and the disease endpoints ascertained by linkage to routinely recorded health data (UK Biobank). The validation study replicated 83% (39/47) of the initial estimates, including the mechanism of action of current lipid-modifying drug targets PCSK9 and NPC1L1 and the suggested mechanism of action of compounds under investigation for lipid modification through TG or HDL-C, such as CETP inhibitors26,27.

As a positive control step, our (genome-wide) biomarker MR analysis replicated previous findings on the potential causal relevance of LDL-C, TG, and HDL-C5,11,28. Importantly, contrary to previous studies, here we replicated findings using a completely independent set of NMR-spectroscopy measured lipids data and CHD cases sourced from UK Biobank. While the causal relevance of LDL-C for CHD has been robustly proven through successful drug development of for example statins, there are as yet no compounds licensed for CHD prevention through effects on HDL-C and TG. Hence, the causal relevance of the lipid sub-fraction, while supported by the current genome-wide biomarker MR analyses, cannot be concluded definitively. It is therefore essential to highlight that, while our drug target analysis uses genetic associations with these lipid sub-fractions as weights, our inference throughout has been on the therapeutic relevance of perturbing the proteins encoded by the corresponding genes which are the main category of a molecular target for drug action. The genetic associations with the corresponding lipids are merely used as a proxy for protein activity and/or concentration, serving to orientate the MR effects in the direction of a therapeutic effect. They do not provide comprehensive evidence on the pathway through which perturbation of such targets causally affects CHD. Nevertheless, MVMR does provide insight on the potential relevance of lipid pathways in mediating the effects of drug target perturbation. In general, results that do not meet the significance threshold should not be over-interpreted as proof of the absence of effect29. This may be exacerbated here by potential weak instrument bias, which will be expected to attenuate results towards the no-effect direction.

The set of 30 replicated drug targets also included lipoprotein lipase (LPL), a target that could potentially decrease CHD risk based on the univariable MR findings, with an effect through HDL-C further endorsed by the co-localization and MVMR analyses (Fig. 4). In contrast to current lipid-lowering drug targets which are specifically expressed in the liver, LPL shows the highest specific expression in adipose tissue which suggests tissues beyond the liver may be relevant to target lipid metabolism. Several pharmacological attempts have been pursued to target LPL30,31, and gene therapy has also been applied to treat LPL deficiency by introducing extra copies of the functional enzyme in patients with hypertriglyceridemia32. The approval of gene therapy interventions and the known indirect activation of LPL by drugs targeting other proteins, such as fibrates33 and metformin34, suggest that the previous failure of compounds targeting LPL in initial trials may have been idiosyncratic. LPL activity is also modulated by another protein in the replicated dataset, apolipoprotein A5 (ApoA5), which is exclusively expressed in liver tissue. The MVMR suggests that ApoA5 (partially) affects CHD through LDL-C and TG-mediated pathways. Regardless of the mediating lipid or lipids, the genetic findings in relation to both LPL and ApoA5 are consistent and point to this as an important potentially targetable pathway in atherosclerosis, supporting prior work35.

To provide an indicative genetic profile of a drug target and hypothesize about potential mechanism-based adverse effects, repurposing opportunities or expansion of the indication portfolio of a drug target, we performed a PheWAS of variants in and around the replicated set of targets on 102 traits. While in some cases PheWAS highlighted associations with particular clinical endpoints, for example, the rediscovery of already known indications or biological pathways, such as the associations of type 2 diabetes with variants in NDUFA13 or the association of Alzheimer’s Disease with APOE, further research is needed to evaluate the causal role of the target in the corresponding disease and the beneficial or detrimental effects of modulating those targets pharmacologically.

Some limitations of this study are noteworthy. First, we only included genes regarded as encoding druggable proteins, which currently comprise approximately 25% of all protein-coding genes18. As knowledge advances, additional proteins will become druggable, and alternative therapeutic strategies such as antisense oligonucleotides and gene therapy may extend the range of mechanisms that can be targeted. The approach we describe is in fact agnostic to therapeutic modality and could be adapted accordingly. Notably, antisense oligonucleotides were efficiently delivered to the liver36, where 54% of the prioritized targets in our analysis showed elevated expression compared to other tissues. Second, we assigned variants to druggable genes based on genomic proximity, which may be as reliable as other approaches in mapping causal genes37,38,39. However, simple genomic proximity might result in the misleading assignment of the causal gene in a region containing multiple genes in high LD (e.g., PVRL2, APOC1, and APOE are all located in a region of LD in Chr19:45349432-45422606, GRCh37). In an effort to account for this, all the druggable genes (±50 kbp) that overlap one of the genetic variants associated with LDL-C, HDL-C, or TG were included in the analysis, and we provided information on the proximity of the variant to the gene, a gene distance rank value (in base pairs), and previous gene prioritization data by the Global Lipids Genetics Consortium (GLGC)12 to inform scenarios in which the causal gene may be a non-druggable gene but reside in the same region (Supplementary Data 1). Lastly, because some but not all of the studies contributing to these consortia measured blood lipids on a fasting sample, we are unable to conduct separate analyses based on genetic effects in the fasting and non-fasting state.

We used cis-MR to evaluate the relevance of each drug target to CHD, which is less prone to violation of the horizontal pleiotropy assumption than MR analyses with trans instruments3, which also require direct measurement of the protein of interest. However, cis-MR also requires some decisions to be made regarding instrument selection: defining the locus of interest, the significance threshold for the association with the exposure, and the LD threshold to prune correlated instruments. Since an agreement on the choice of a general LD threshold and flanking region has yet to be reached, we used a window of 50kbp and LD threshold of 0.4, which showed the most consistent estimates in a grid-search in the discovery data using the four positive control examples: PCSK9, NPC1L1, HMGCR, and CETP. Based on previous studies showing that using less stringent P value thresholds often results in improved performance in cis-MR settings, we relaxed the threshold below genome-wide significance to select the genetic associations to instrument the exposure; and accounted for LD correlation by pruning and LD modeling during the MR analysis3,40.

We addressed multiple testing in the MR analyses in a number of complementary ways. To assess the potential for false-positive results, we tested the distribution of the exposure-specific P values against the uniform distribution expected under the null hypothesis19. The KS goodness-of-fit test indicated that the number of extreme P values obtained would be highly unlikely under the null hypothesis, suggesting that they are unlikely to represent false positives. Next, we validated our findings with independent data sources and conducted a second drug target MR, although several drug target genes could not be evaluated in the validation analysis because the gene boundaries did not include genetic associations exceeding the pre-specified significance threshold (P value ≤ 1 × 10−4), likely related to the “modest” sample size of the NMR replication data (N = 33,029). By drawing inference on replicated data, the multiple testing burden was considerably reduced (0.052 = 0.0025), which when applied to 98 drug targets retained after replication would suggest up to one result being a false positive.

Beyond univariable MR analyses, we attempted to further validate the findings with a multivariable extension of the inverse-variance weighted (IVW) and MR Egger methods, however, in some cases we observed imprecise estimates in line with previous studies which attributed this to the inclusion of highly correlated exposures in the model41. To further evaluate if the association signals in the exposure and outcome datasets shared a causal genetic variant, we performed colocalization analyses. Because these analyses were originally developed to find evidence of co-localization between mRNA expression and disease and not for an intermediate trait and disease, the default prior probabilities used in the analysis may not be optimal for these pairs of traits. In addition, the single-causal-variant assumption in genetic co-localization methods may not always be satisfied even when prior conditional analyses are performed, with regions with multiple causal variants potentially yielding false-negative results42.

The effect directions of the replicated drug targets were compared to results from clinical trials using data from the clinicaltrials.gov registry. However, the lack of precision in the annotation of events associated with lipid perturbations (e.g., hyperlipidemia) in this dataset hinders the assignment of reported lipid abnormalities to a particular lipid sub-fraction. Moreover, the proportion of clinical trials with reported results in clinicaltrials.gov is less than 54.2%43, suggesting that additional drug candidates with lipid effects might have been investigated but were not included in this analysis because of the lack of accessible data. Furthermore, our analysis relied on mapping clinical trial interventions to compounds known to act through binding to the targets of interest, which could potentially miss clinical trials of compounds annotated with fewer synonyms (such as research codes for compounds used by individual trial sponsors). Lastly, we performed a PheWAS spanning over 100 clinical endpoints, 80 of which were derived from UK Biobank. While this enabled screening for associations with a wide range of diseases, genetic associations derived from diagnostic codes in electronic health record datasets might suffer from limited case numbers and inaccurate case and control definitions, which would reduce the power to detect true associations. To increase the power to detect associations, we included data from publicly available GWAS with the largest sample sizes for such phenotypes.

In summary, we have shown an approach to move from GWAS signals to drug targets and disease indications. We illustrated its potential using genetic association data on lipids and CHD data, but the approach could also be applied in other settings where there are GWAS of diseases and biomarkers thought to be potentially affected by the drug target. For example, with the increasingly available data on inflammatory biomarkers, this approach could be used to evaluate the causal role of anti-inflammatory drug targets, such as IL6R, in CHD, Alzheimer’s disease, and major depression, following up on associations described in several studies44,45,46, to identify potential new indications for anti-inflammatory agents established in the treatment of autoimmune conditions. Similarly, recent genetic studies on coagulation factor levels47 can be harnessed to instrument the effect of modulating druggable targets for thrombotic disorders, such as FXI or FXII, which are emerging as potential targets for anticoagulant drugs48,49.

When used as a screening tool, the approach could help reduce the high failure rate problem in drug discovery by genetically validating targets in the earlier phases of the drug development pipeline.

Methods

Data sources

To determine the causal role and replicate previously reported results on the causal effect of LDL-C, HDL-C, and TG on CHD, we obtained summary-level genetic estimates from the Global Lipids Genetics Consortium (188,577 individuals)12 and from CardiogramPlusC4D (60,801 cases and 123,504 controls)13.

Independent replication data were sourced using lipids exposure data from a GWAS meta-analysis of metabolic measures by the University College London–Edinburgh-Bristol (UCLEB) Consortium15 and Kettunen et al.16 utilizing NMR spectroscopy measured lipids (joint sample size up to 33,029). Independent CHD data was obtained from a publicly available GWAS of 34,541 cases and 261,984 controls in UK Biobank17.

Individual-level data from a random subset of 5000 unrelated individuals of European ancestry from UK Biobank was used to generate the LD reference matrices as described in the Instrument selection section.

Drug target gene selection

Analyses were conducted using Python v3.7.3. To estimate the causal effect of modulating the level of each lipid sub-fraction via a druggable gene on CHD, variants associated with LDL-C, HDL-C, and/or TG with a P value ≤ 1 × 106 were selected. Druggable genes overlapping a 50 kbp region around the selected variants were extracted, resulting in 341 associated drug target genes (149 for LDL-C, 180 for HDL-C, and 154 for TG). The set of genes in the druggable genome were identified18, and identifiers were updated to Ensembl version 95 (GRCh37), used in this analysis. Because we only scanned for genetic associations with the druggable genome, protein-coding genes that were the “true” causal gene but not yet druggable would be missed and the association misassigned. To mitigate this and provide information about potential effects through non-druggable genes, we provide the minimum distance from the variant to the gene, where variants located within a gene were given a distance of 0 bp, a gene distance rank value according to their base-pair distance, and indicated the druggable genes prioritized by GLGC (Supplementary Data 1).

Instrument selection

For the genome-wide biomarker MR analyses, a P value threshold of 1 × 10−6 was used to select exposure variants associated with LDL-C, HDL-C, and/or TG. For cis- or drug target MR analyses, variants from/within the 341 selected genes (±50 kbp) were selected based on a P value ≤ 1 × 10-4. In both settings, variants were filtered on a MAF > 0.01 and LD clumped to an r2 < 0.4. These parameters showed the most consistent estimates in a grid-search in the discovery data using the positive control examples: PCSK9, NPC1L1, HMGCR, and CETP (Supplementary Fig. 34). To account for residual correlation between variants in the MR analyses, we applied a generalized least squares framework with an LD reference dataset derived from UK Biobank50. LD reference matrices were created by extracting a random subset of 5000 unrelated individuals of European ancestry from UK Biobank. Variants with a MAF < 0.001, and imputation quality < 0.3 were excluded. To ensure that SNPs with lower MAF have higher confidence, variants were removed if MAF < 0.005 and genotype probability < 0.9; MAF < 0.01 and genotype probability < 0.8; MAF < 0.03 and genotype probability < 0.6.

MR analysis

As a validation step, a genome-wide biomarker MR analysis was conducted for each lipid sub-fraction to replicate previous findings using genetic associations across the genome. A model-selection framework was used to decide between competing IVW fixed-effects, IVW random-effects, MR–Egger fixed effects or MR–Egger random-effects models14. While IVW models assume an absence of directional horizontal pleiotropy, Egger models allow for possible directional pleiotropy at the cost of power. After removing variants with large heterogeneity (P value < 0.001 for Cochran’s Q test) or leverage, we re-applied this model selection framework and used the final model. The influence of parameter selection in the drug target MR performance was explored in a grid-search of several r2 and gene boundaries combinations using the positive control examples PCSK9, NPC1L1, HMGCR, and CETP, where the lipid perturbation is the intended indication. To assess the possibility of false-positive results, we compared the empirical P value distribution of the discovery MR findings against the continuous uniform distribution using the KS goodness-of-fit test (two-sided). Under the null hypothesis of no association, P values follow a continuous uniform distribution between 0 and 119.

In addition, we conducted genome-wide biomarker and drug target multivariable MR analyses using genetic associations with the three lipid sub-fractions and CHD risk in a single regression model, to identify likely mediating lipids in the causal pathway of CHD.

Results were presented as mean difference (MD) or OR with 95% confidence interval (95% CI) coded towards the canonical drug target effect direction; i.e., toward lower LDL-C and triglyceride concentration, and higher HDL-C concentration.

Co-localization analysis

To estimate the posterior probability of each druggable gene sharing the same causal variant for the exposure lipid and CHD risk51 we performed colocalization analyses. First, we conducted a stepwise conditional analysis using GCTA-COJO v1.92.4 with genotype data from 5000 individuals randomly selected from UK Biobank52. Colocalization analyses were performed using a Python implementation of the Bayesian method “coloc” v3.2-120. The default prior probabilities were used to estimate if an SNP was associated only with the lipid sub-fraction (p1 = 10−4), only with CHD risk (p2 = 10−4), or with both traits (p12 = 10−5). For each drug target gene, all variants from/within the gene boundaries (±50 kbp) with a MAF > 0.01 were included. A posterior probability above 0.8 was considered sufficient evidence of colocalization based on previous observations20.

Drug indications and adverse effects

To evaluate if the drug target MR and colocalization analyses rediscovered known drug indications, adverse effects, or predicted repurposing opportunities, drug information, and clinical trial data were extracted for the set of 341 druggable targets. Drug target genes were mapped to UniProt identifiers and indications and clinical phase for compounds that bind the target were extracted from the ChEMBL database (version 25)53. Drug indications and lipid adverse effects data for licensed drugs were extracted from the British National Formulary (BNF) website (https://bnf.nice.org.uk/) in July 2019.

To further examine the effects of the drugs and clinical candidates that are known to act through binding to the 341 druggable targets, relevant clinical trial data were downloaded from the clinicaltrials.gov registry. Compound names and synonyms were extracted from ChEMBL database (version 25)53 and used to identify clinical trials with matching interventions. In the case of non-exact matches, the results were inspected manually to ensure that only relevant trial records were used in the analysis. Lipid-related trial outcomes and adverse events were identified by searching the relevant fields within the trial records with the keywords: lipo*, lipid*, ldl*, hdl*, cholest*, and triglyceride*. For adverse events, the search was limited to the trial arm in which the drug of interest was administered (as opposed to placebo or active control used in the study), and only adverse events that affected at least one study participant were included.

Tissue expression analysis

To further characterize the genes prioritized by the MR pipeline, their tissue expression was analyzed as follows. First, RNAseq data were downloaded from Human Protein Atlas (HPA)22, which captures the baseline expression of human genes and proteins across a panel of diverse healthy tissues and organs. For each included gene and tissue, HPA provides a consensus Normalized eXpression value (NX), obtained by normalizing TPM (transcripts per million) values from three independent transcriptomics datasets: GTEx54, Fantom555, and HPA’s own RNAseq experiments56.

The downloaded NX values were then used to investigate if the prioritized target genes were specifically expressed in any of the included tissues. Two commonly used tissue specificity metrics were calculated for each gene: tau and z-score23. Tau summarizes the overall tissue distribution of a given gene and ranges from 0 to 1, where 0 indicates ubiquitous expression across all included tissues (house-keeping genes) and 1 indicates narrow expression (highly tissue-specific genes). While tau provides a single summary measure of the tissue specificity, z-scores are calculated for individual tissues separately to quantify how elevated the gene expression is in a particular tissue compared to others. Here, higher z-score values indicate higher tissue specificity. See Kryuchkova-Mostacci et al.23 for details on the calculation and interpretation of the two metrics.

Phenome-wide scan of replicated drug target genes

To provide an overview of potential non-CHD effects of the prioritized drug targets, we performed a phenome-wide scan of 102 disease endpoints. These included genome-wide summary statistics for 80 ICD10 main diagnoses in UK Biobank, with the exclusion of Coronary Artery Disease (ICD10 code: I25), which was explored in detail in the previous sections. The data were released by Neale Lab (1st August 2018, http://www.nealelab.is/uk-biobank/), and downloaded using a Python implementation of MR Base API57. The variants in and around the prioritized drug target genes allowing for a boundary region of 50 kbp were extracted, palindromic variants were inferred using the API default MAF threshold of 0.3 and removed58. The Ensembl REST Client was used to gather positional information for the variants59.

Power was further maximized by sourcing data from 23 publicly available GWAS with the largest sample sizes for such phenotypes (Supplementary Table 3 and Supplementary Data 5). All the GWAS clinical endpoints and UK Biobank ICD10 main diagnoses were grouped according to ICD10 chapters.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.