Background

Identifying causal loci and genes from human genetic data is integral to elucidating novel disease insights and therapeutic approaches. Quantitative hematopoietic traits are well studied, although relatively few causal variants and genes have been elucidated [1, 2]. Mendelian randomization studies have established causal relationships between hematopoietic traits and cardiovascular, autoimmune and neuropsychiatric disease [2], but causal genes and loci remain elusive.

Genetic colocalization analysis permits identification of shared regulatory loci, with advances extending the scope of potential studies from two to over 10 traits undergoing simultaneous analysis [3,4,5]. Recently, a colocalization algorithm was used to identify known and novel loci related to cardiovascular traits [5]. Key assumptions of this algorithm include i) consistent linkage disequilibrium patterns across studies (i.e., that studies were conducted on the same population), ii) there being at most one causal variant per genomic region per trait, and iii) that causal variants are directly identified or imputed in all datasets [5]. We reasoned that a similar analytical pipeline could help explain variants and genes underlying hematopoietic and other disease phenotypes. In this way, aggregated summary statistics might be used to specifically target loci with pleiotropic effects on multiple traits, enacted through one or a handful of genes.

Developmental cell types during hematopoiesis, the process that gives rise to all blood lineages, are relatively well mapped. We hypothesized that shared genetic signal impacting traits from multiple blood lineages might nominate genomic loci related to the stem and progenitor cells that spawned those types of blood cells. This approach is orthogonal to prior data that analyzed patterns in accessible chromatin to define genomic locations affecting multiple blood lineages [1]. For example, a shared single nucleotide polymorphism (SNP) related to quantitative variation in platelet, red blood cell (RBC), and white blood cell (WBC) counts might indicate a site or mechanism that is active in hematopoietic stem and progenitor cells (HSCs). SNPs related to platelet and RBC counts, but not WBC count, might reveal loci and related genes for megakaryocyte-erythroid progenitor (MEP) cells. We hypothesized that the directionality of such relationships might help elucidate lineage decisions during hematopoiesis, and help target loci and genes related to developmental hematopoiesis.

Blood traits are related to a number of human disease phenotypes [2]. Blood cells can cause disease (e.g., autoimmune traits) or be affected by therapies (e.g., anemia secondary to chemotherapy). For this reason, understanding pleiotropic associations between blood and other traits could reveal translationally relevant trait relationships or help predict off-target effects of gene-modifying therapies.

Here, we used genetic colocalization to define sites wherein two or more human traits shared genetic signal at genome-wide significant loci. We initially examined blood traits, and later expanded our analysis to include a total of 70 blood, autoimmune, cardiovascular, cancer, and neuropsychiatric traits. We then looked for quantitative trait loci impacting gene expression (eQTL) or exon splicing variation (sQTL) at or near sites of genetic colocalization. Our results identify sites that affect specific cell types during hematopoietic development, and reveal genetic variants underlying trait relationships between blood parameters and disease end-points.

Methods

SNP and study selection

GWAS summary statistics were obtained from publicly available repositories (Additional file 1: Table S1 [2, 6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]). We narrowed analysis to just those GWAS summary statistics for European populations with > 1 × 106 sites (i.e., those that were genome-wide). Analyzed SNPs were identified as genome-wide significant in the largest hematopoietic trait GWAS to date [2] or from a repository of genome-wide significant SNPs from a compilation of GWAS from the NHGRI-EBI Catalog (downloaded January 2019) [22]. In addition, we analyzed quantitative trait locus data from GTEx V7 [23]. Human genome version hg19 was used for all analyses.

Colocalization analyses

We used the HyPrColoc software to conduct colocalization experiments [5]. This software requires effect (e.g., beta or odds ratio) and standard error values for each analyzed SNP. We chose to analyze based on chromosome and position, given that multiple rsIDs might overlap at a given locus and be inconsistent between different GWAS. Although this removed duplicate rsIDs and may have caused some bias, we reasoned that this would be a minority of sites. This strategy optimized the number of individual positions that we were able to incorporate into our input dataset for colocalization analysis. We specifically looked at 500 kb regions (250 kb on either side of each site), in line with prior colocalization literature [5].

As the SNPs considered as input data varied between analyses, we presented separate results from analysis of 34 hematologic traits, and a composite of 70 traits. GWAS summary statistics were harmonized prior to analyses (https://github.com/hakyimlab/summary-gwas-imputation/wiki). There were 29,239,447 genomic sites analyzed for colocalization among the hematologic traits. A total of 1,667,428 harmonized sites were analyzed from GWAS summary statistics for the 70 traits. The decreased number of sites included in this latter analysis resulted in decreased power to detect associations. This was reflected in the maximum number of traits colocalized, which in Fig. 1a was 25 (out of 34 traits) versus 24 traits in Fig. 2a (out of 70 traits). The number of sites used for ‘restricted’ analysis of traits with limited genetic correlation (rg < 0.8) were similar to ‘full’ analyses, including 29,261,510 genomic sites for 17 blood traits (Additional file 2: Figure S1), and 1,667,741 genomic sites for 45 traits (Additional file 2: Figure S2).

Fig. 1
figure 1

Genetic colocalization between blood traits reflects hematopoietic lineage relationships. a Number of traits identified at each colocalization site (max = 25). b Heat map depicting percent overlap at colocalization sites between each hematopoietic trait pair. In each box, the number of sites where the row-specified trait and column-specified trait colocalized was normalized to the total number of colocalization sites for the ‘row trait’. For this reason, the heat map is asymmetric. Color scale represents the proportion of loci where each pair of traits colocalized. To the left of the heat map, hierarchical clustering accurately segregated red cell, platelet, and white cell traits in general agreement with blood lineage relationships. c Degree of colocalization (% overlap) generally reflects genetic correlation between trait pairs. Shaded area depicts the 95% prediction interval, with gray line at mean. Colored spots highlight trait pairs outside the 95% prediction interval that included two platelet traits (purple) or two red blood cell traits (red). Exemplary trait pairs are circled. Eo % gran, percentage of granulocytes that are eosinophils. Neut % gran, percentage of granulocytes that are neutrophils. Plt, platelet count. Mpv, mean platelet volume. Mchc, mean corpuscular hemoglobin content. Mcv, mean red cell corpuscular volume. Neut, neutrophil count. Neut+eo, total neutrophil plus eosinophil count

Fig. 2
figure 2

Genetic colocalization reveals shared regulatory loci and implicates causal genes underlying genetic associations between hematopoietic traits and disease end-points. a Number of traits identified at each colocalization site (max = 24). b Heat map depicting percent overlap at colocalization sites between each trait pair. In each box, the number of sites where the row-specified trait and column-specified trait colocalized was normalized to the total number of colocalization sites for the ‘row trait’. For this reason, the heat map is asymmetric. c Hierarchical clustering based on colocalization results associates related traits, which are color coded according to the key in part b. d Degree of colocalization (% overlap) reflects genetic correlation between trait pairs. Shaded area depicts the 95% prediction interval, with gray line at mean. Exemplary trait pairs are circled. Depsx, depressive symptoms. Rbc, red blood cell count. Baso, basophil cell count. Brca, breast cancer. Scz, schizophrenia. Eo%, eosinophil percentage of white blood cells (‘eo_p’) or granulocytes (‘eo_p_gran’)

After colocalization analysis, we narrowed our focus on only those loci with posterior probability for colocalization (PPFC) > 0.7, based on empiric simulations results from the creators of this algorithm showing that this conservatively gave a false discovery rate < 5% [5]. We noted that a more relaxed PPFC (e.g., > 0.5) yielded substantially more loci. A less conservative threshold could in this way be used as a hypothesis-generating experiment for cellular follow up studies.

Coding variant identification

We used the Ensembl Variant Effect Predictor (http://grch37.ensembl.org/Homo_sapiens/Tools/VEP) to identify coding variants and related gene consequences.

Linkage disequilibrium and quantitative trait locus (QTL) analyses

We wanted to assess comprehensively the potential gene expression or splicing changes related to colocalization sites. Thus, we analyzed each colocalization site together with all sites in high linkage disequilibrium (EUR r2 > 0.90, PLINK version 1.9).

We used closestBed (https://bedtools.readthedocs.io) to identify the nearest gene to each SNP. Genes and positions were defined by BioMart (https://grch37.ensembl.org/biomart/martview).

For each group of linked SNPs around a colocalization locus, we identified all eQTLs (GTEx V7 [23]), as well as all sQTLs as defined by two different algorithms (GTEx V3 sQTLseekeR [24], Altrans [25]). In the manuscript and Additional file 1, the quantity of QTL SNPs and pathway analyses reflect a composite of all genes impacted by a given locus, or by highly linked SNPs. Note that a given colocalization site might be linked with several SNPs, and that these SNPs might be proximal to and/or impact different genes. Affected genes shown are those with a unique Ensembl gene identifier (ENSG). In some cases, gene names may differ between Nearest Gene, eQTL and sQTL columns given that the underlying analyses were derived from different catalogues.

Gene ontology analysis

We submitted QTLs associated with specific traits for biological pathway assessment using the Gene Ontology (GO) resource (http://geneontology.org). Statistical significance of GO Biological Process enrichments were assessed using binomial tests and Bonferroni correction for multiple testing. Presented data were those pathways with p < 0.05.

Empirical distribution for expected colocalization counts

We used LDSC to estimate genetic correlation between traits (v1.0.1) [26]. Presented genetic correlation data reflect rg values obtained from LDSC analysis.

Mendelian randomization

We created genetic instrumental variables from GWAS summary statistics for blood traits [2], eczema [15], and depressive symptoms [20]. To generate instrumental variables, we first identified SNPs common to both exposure and outcome data sets. Using Two-sample MR (v0.5.4 [27]) and R (v3.6.3), we then clumped all genome-wide significant SNPs to identify single nucleotide polymorphisms within independent linkage disequilibrium blocks (EUR r2 < 0.01) in 10,000 kb regions.

We used mRnd (http://cnsgenomics.com/shiny/mRnd, [28]) to estimate the F-statistics of our instrumental variables. We calculated the proportion of genetic inheritance explained per Shim et. al. [29]. None of our instrumental variables was subject to weak instrument bias, as each had an F-statistic > 10 [28].

Data presentation

Data were created and presented using R, Adobe Illustrator CS6 and GraphPad Prism 8.

Statistics

Statistical analyses were conducted using R and GraphPad Prism 8.

Results

Genetic colocalization recapitulates hematopoietic lineage relationships

Our first aim was to validate whether colocalization could effectively capture known trait relationships and genetic correlations between hematopoietic lineages [1]. We performed colocalization analysis [5] using genome-wide association study (GWAS) summary statistics related to 34 quantitative hematopoietic traits for 2706 genome-wide significant loci [2], revealing a total of 1779 sites wherein 2 or more traits colocalized with a PPFC > 0.7 (Additional file 1: Table S2). In simulations, these criteria identified the causal variant, or a variant in high LD with the causal variant, with a false discovery rate < 5% [5]. Colocalization sites specified 3.6 ± 2.3 traits (mean ± SD), with 22% of the loci (259 in total) representing highly pleiotropic sites where 6 or more traits colocalized (Fig. 1a). Hence, a substantial proportion of interrogated loci (66%) impacted multiple hematopoietic traits.

To investigate trait relationships, we constructed a heat map to depict the percentage colocalization between trait pairs (Fig. 1b). Hierarchical clustering of colocalization results reflected blood lineage relationships, with platelet, erythroid, and white blood cell traits generally clustering as expected.

We then asked whether our colocalization findings mirrored genetic correlation between hematopoietic traits [26]. Indeed, more closely related traits colocalized more often (Fig. 1c, r2 = 0.91 by quadratic regression with least squares fit). Directly correlated (e.g., ‘neutrophil count’ and ‘neutrophil + eosinophil count’; ‘granulocyte count’ and ‘myeloid white blood cell count’), and inversely correlated trait pairs (e.g., ‘eosinophil percent of granulocytes’ and ‘neutrophil percent of granulocytes’; ‘lymphocyte percent’ and ‘neutrophil percent’), essentially always colocalized.

Several trait pairs fell outside the 95% prediction interval. The majority of these trait pairs included two traits from the same hematopoietic lineage (e.g., ‘mean platelet volume’ and ‘platelet count’; ‘mean corpuscular hemoglobin concentration’ and ‘mean red cell volume’) (Fig. 1c). Lineage-critical loci or genes might be expected to have more significant influence on these trait pairs than would be captured by genetic correlation measurement.

In sum, these results validated the notion that colocalization analysis results would mirror genetic correlation, and reflect known relationships among hematopoietic lineages and traits. Interestingly, trait pairs without genetic correlation frequently had some degree of colocalization (Fig. 1c, y-intercept = 0.077 ± 0.123). This likely reflects horizontal pleiotropy, in which a given locus and related gene(s) impact traits that are not biologically related. In the context of hematopoietic development, our derived estimate of chance colocalization between unrelated traits is therefore ~ 8%.

Given high genetic concordance between some blood traits, we also performed colocalization analysis after removing traits with genetic correlation (rg) > 0.8. This experiment, using 17 quantitative hematopoietic traits, identified 946 colocalization sites for 2 or more traits with a PPFC > 0.7, representing 35% of interrogated loci (Additional file 1: Table S3 and Additional file 2: Figure S1). Compared with our analysis of 34 blood traits, the number of traits that colocalized at each locus was reduced as expected (2.6 ± 1.3, mean ± SD). Importantly, both analyses identified similar sites and trait relationships. Below, we focus on findings from more comprehensive colocalization experiments using 34 traits.

A genetic colocalization strategy to identify loci related to hematopoietic development

We leveraged our colocalization results to identify quantitative trait loci (QTLs) related to specific hematopoietic lineages and cell types. For example, loci where white blood cell (WBC), red blood cell (RBC), and platelet counts colocalize might indicate developmental perturbation in hematopoietic stem and progenitor cells (HSCs). We therefore looked for sites of colocalization between these quantitative blood traits, and identified overlapping genome-wide significant QTLs. Indeed, QTLs related to these loci pointed to known HSC regulatory genes SH2B3 [30, 31], ATM [32], and HBS1L-MYB [33] (Additional file 1: Table S4).

We also parsed loci identified by colocalization to specifically affect platelet or red cell traits, with the hypothesis that these loci would relate to terminally differentiated blood cell biology. There were 439 sites nominated by colocalization analysis specifically for red cell traits (RBC, HCT, MCV, MCH, MCHC, RDW) but not platelet traits or WBC count. These sites, or highly linked loci, influenced expression of 614 genes (123 genes in whole blood, Additional file 1: Table S5). Among genes regulated in whole blood were RHD [34], HBZ [35], and LPL [36], which can influence erythroid stability and/or lifespan, as well as SP1 [37], ESR2 [38], and FANCA [39], which impact erythropoiesis. Gene ontology (GO) analysis [40] of these gene sets revealed significant enrichment of genes related to cellular metabolic processes (Additional file 1: Table S6). A similar analysis of platelet trait-restricted sites (PLT, PCT, MPV, PDW), including highly linked loci, identified 270 sites impacting expression of 399 genes (77 genes in whole blood, Additional file 1: Table S7). These genes included STIM1 [41] and C4BPA [42], which impact platelet reactivity and/or thrombosis risk, as well as MASTL [43] and TPM4 [44], which influence megakaryo-thrombopoiesis. Pathway analysis of these genes revealed enrichment of apoptotic cell clearance and metabolic processes (Additional file 1: Table S8). Complement-mediated apoptotic cell clearance mechanisms are indeed important for regulating platelet count [45].

To our surprise, pathways analyses of red cell and platelet lineage-restricted colocalization QTLs were not enriched for processes ascribed to hematopoiesis, erythropoiesis, or megakaryopoiesis. This suggests that genes and processes linked to terminal red cell and platelet traits are largely impacted by cellular function and reactivity, rather than developmental perturbations. With notable exceptions whereby causal loci do impact hematopoietic development (e.g., [30, 46,47,48]), our findings suggest the many of the identified genes and factors may not impact hematopoiesis per se. In fact, our results indicate that blood cell-extrinsic properties (e.g., apoptotic cell clearance mechanisms) frequently impact quantitative hematopoietic traits. In sum, our findings reveal a multitude of known variants and genes, as well as novel QTLs and related genes that warrant further study.

Illuminating hematopoietic contributions and associations with disease phenotypes

We then applied an extended colocalization analysis to summary statistics for 70 total hematopoietic, cardiovascular, autoimmune, cancer, and neuropsychiatric traits (Additional file 1: Table S1 [2, 6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]). Variations in size and power across these studies would be expected to influence detection of trait associations and/or colocalizations. Following allele harmonization, colocalization analysis using 9852 genome-wide significant loci from the NHGRI-EBI database [22] and blood traits [2] revealed a total of 2123 sites (22%) wherein two or more traits colocalized with a PPFC > 0.7 (Additional file 1: Table S9). The average number of traits that colocalized at a given site was 3.3 ± 2.4 (mean ± SD), with 83 loci identified as a ‘very pleiotropic’ colocalization site for ≥9 traits (Fig. 2a). Known trait relationships were recapitulated among these colocalization sites (e.g. bipolar disorder and schizophrenia; Fig. 2b-d). These results again reflected genetic correlation between traits, estimating a small degree of pleiotropy (~ 4%) absent genetic correlation (Fig. 2d, r2 = 0.83, y-intercept = 0.037 ± 0.117).

Restricted analysis of 45 traits with genetic correlation (rg) < 0.8 identified 1670 colocalization sites, with 2.6 ± 1.3 (mean ± SD) colocalizing traits per locus (Additional file 1: Supplemental Table S10 and Additional file 2: Figure S2). This latter experiment identified similar loci and trait relationships as our 70-trait analysis. In the interest of providing as comprehensive a list of results as possible, the findings discussed below are derived from joint analysis of 70 traits.

Mendelian randomization analyses, which use genetic variants to estimate causal effects of a genetically-determined exposure of interest on an outcome, have established relationships between blood traits and some disease phenotypes [2]. Despite holding significant therapeutic potential, most causal loci and genes underlying these associations are unknown. Our results reveal putatively causal loci, related genes, and molecular pathways related to these trait pairs (Additional file 1: Tables S11-S17). For example, whole blood QTLs related to genes known to affect asthma pathogenesis or severity (e.g. IL18R [49,50,51], ZFP57 [52], BTN3A2 [53], NDFIP1 [54], SMAD3 [55], CLEC16A [56], and TSLP [57]) were associated with colocalization sites for asthma and neutrophil, eosinophil, monocyte and/or lymphocyte traits (Additional file 1: Tables S11-S14). Similarly, QTLs for genes linked to coronary artery disease risk (e.g. LPL [58], SREBF1 [59], GIT1 [60, 61], SKIV2L [61], MAP3K11/MLK3 [62]) were associated with colocalization sites linking coronary artery disease with mean platelet volume, lymphocyte, and/or reticulocyte counts (Additional file 1: Tables S15-S17). Other identified genes associated with colocalization loci represent novel findings that could enhance understanding of the pathophysiology and/or treatment of these diseases, although functional validation remains necessary.

Our findings also revealed novel trait associations. For example, eosinophil percentage and eczema colocalized more often than predicted based on their genetic correlation (Fig. 2d). These traits are clinically related [63] and colocalized at 13 loci (Additional file 1: Table S18), including sites near genes that regulate eosinophil biology (ETS1 [64, 65] and ID2 [65]) and autoimmune disease (KIAA1109 [66,67,68] and TAGAP [69]). These colocalization sites also indicated potential regulation of unexpected genes that warrant validation (SNX32, ZNF652, KLC2).

We reasoned that Mendelian randomization analysis might provide additional support for this trait relationship. However, we did not necessarily expect significant genome-wide association, given that our colocalization analysis highlighted a fairly restricted subset of loci. By Mendelian randomization, we identified a 27% increased risk of eczema for each 1 standard deviation increase in eosinophil percentage by inverse variance weighted method (95% confidence interval = 9–48%, P = 0.002), although the association did not reach statistical significance for weighted median or MR-Egger methods (Additional file 1: Tables S19-S20). This analysis did not show evidence of horizontal pleiotropy (MR-Egger intercept P = 0.87) and the instrumental variable was not subject to weak instrument bias (F-statistic = 132, Additional file 1: Table S19) [28]. Although these findings would not constitute strong independent evidence of causality alone, they did lend some additional support to the relationship identified through colocalization analysis.

We also identified 5 colocalization sites for red blood cell count, basophil count, and depressive symptoms, which exceeded expectations based on genetic correlation (Fig. 2d and Additional file 1: Table S21). These colocalization sites included eQTLs for YPEL3, which is highly expressed in whole blood [23] and affects neural development [70], as well as PRSS16, which impacts immunologic development [71] and has been implicated in multiple GWAS for depression phenotypes [72]. While blood phenotypes may impact depressive symptoms, it is also possible that these eQTLs and genes have separate functions in hematopoietic and brain tissues. Mendelian randomization experiments did not identify statistically significant causal relationships for red blood cell count or basophil count on depressive symptoms (Additional file 1: Table S19 and Additional file 1: Tables S22-S23), consistent with low genome-wide correlation (Fig. 2d). Future GWAS for depressive symptoms with increased size and power may better elucidate causal relationships, if such relationships exist.

In addition, we identified trait relationships beyond hematologic parameters, including 4 colocalization sites for breast cancer and schizophrenia (Fig. 2d and Additional file 1: Supplemental Table S24). Recent epidemiologic [73] and genetic [74] studies have linked schizophrenia and breast cancer risks. Our results nominate TCF7L2 [75, 76], BCAR1 [77], and NEK10 [78, 79] as potential targets to help explain this association.

Colocalization at coding variation sites identifies clinically relevant trait associations

We reasoned that colocalizing sites could help explain unexpected or pleiotropic effects of gene perturbations. Here, we focused on missense variation in coding regions to establish direct locus-gene relationships. This approach identified clinically relevant cross-trait associations.

Variation in rs2476601 causes a missense mutation in PTPN22 (Cys1858Thr). This site has been linked to autoimmunity and Crohn’s disease phenotypes, but not ulcerative colitis [80]. Immune response dysregulation, including WBC biology, contributes to the Crohn’s phenotype [80]. We identified shared genetic signal for increased Crohn’s disease risk and decreased WBC count, but not ulcerative colitis, at this location (Additional file 1: Table S25). This finding supports a specific clinical association with Crohn’s disease for the PTPN22 Cys1858Thr mutation.

Mean platelet volume (MPV) variation has previously been linked to altered risk of coronary artery disease, but understanding of genes underlying this association is lacking [2]. We identified colocalizing signals for increased coronary artery disease risk and increased MPV in a missense coding mutation for ZC3HC1/NIPA (Additional file 1: Table S15). This variant causes an Arg > His missense change in several NIPA isoforms. NIPA impacts heart disease risk and cell cycle regulation [81]. Further studies are needed to understand how this gene might coordinately impact platelet biology and coronary artery disease risk, as well as other traits linked to this locus.

Altered lipid and cholesterol levels have been clinically observed in patients with hereditary hemochromatosis due to mutations in High FE2+ (‘high iron’, HFE) [82]. Patients with hemochromatosis have lower cholesterol levels than normal, although an open question is whether this observation is due to manifestations of disease or HFE deficiency itself. Our data show that individuals heterozygous for the Cys282Tyr allele have lower reticulocyte count and higher total cholesterol and low density lipoprotein levels (Additional file 1: Table S26). This suggests that HFE haploinsufficiency increases cholesterol and lipid levels, and that decreased cholesterol in patients with hemochromatosis occurs secondary to myriad tissue manifestations of clinically significant hemochromatosis or iron overload [83].

Finally, we hypothesized that our analysis might also help predict off-target effects of novel therapeutic agents. For example, tumor necrosis factor (TNF)-related apoptosis inducing ligand (TRAIL) is a promising novel chemotherapeutic target [84]. A mutation in the TRAIL 3′ UTR was recently associated with decreased triglyceride levels [85]. Targeted analysis of this site identified colocalizing signals for altered myeloid and platelet indices (Additional file 1: Table S27). It will be interesting to see whether these traits are affected in upcoming clinical trials targeting TRAIL.

Discussion

Genetic colocalization approaches have proven a powerful tool in revealing pleiotropic effects of certain loci on multiple traits [3, 4]. Here, we have adapted the colocalization methodology to reveal sites and genes related to specific cell stages in hematopoietic development, and identify relevant trait relationships between blood traits and human disease end-points. We present what we believe to be a minimal estimate of these associations, given the assumption of at most one causal locus per genomic region and our conservative threshold for colocalization (PPFC > 0.7). This threshold revealed high-confidence targets, although future gene discovery studies might instead use a more relaxed threshold (e.g., PPFC > 0.5) to enable a more encompassing set of loci.

GWAS have linked thousands of genomic sites with blood trait variation [2]. The biology related to each site could relate to developmental hematopoiesis, as has been shown for CCND3 [46], CCNA2 [47], SH2B3 [30], and RBM38 [48]. Alternatively, biology related to GWAS sites might impact terminally differentiated cell reactivity or turnover. For example, altered platelet reactivity can affect quantitative platelet traits [86, 87]. Cellular validation experiments might be streamlined if one could better parse relevant sites, genes and developmental stages based on GWAS information. Gene targets presented herein represent one approach to such a computational pipeline, and are orthogonal to previously published findings based on accessible chromatin patterns during hematopoietic development [1]. Future studies combining these computational modalities might be useful for those interested in evaluating specific genes or loci in blood progenitor biology.

Our expanded analysis of 70 human traits recapitulated known trait relationships between blood traits and human disease phenotypes, and identified sites with potential translational relevance. Variations in GWAS size and power may have limited our ability to identify certain trait associations. We anticipate that increasingly well-powered GWAS will likely to expand the catalogue of colocalizations in the future. Larger studies may also reveal new causal genetic associations in Mendelian randomization analyses, although trait relationships need not meet genome-wide significance to be biologically important. In fact, each colocalization site identified in our analysis could be viewed as a hypothesis-generating site for future cellular validation. Understanding trait relationships through colocalization analysis may also be useful for multivariable Mendelian randomization and/or mediation analyses designed to reveal causal biological mechanisms.

Understanding how missense coding mutations impact phenotypes offers the most direct relationship between genes and traits. An adaptation of our colocalization strategy might be employed to predict off-target effects of gene modulation, help understand the cellular basis of disease, or investigate unexpected cellular developmental relationships (e.g., sites related to multiple mesoderm-derived tissues might triangulate to early mesodermal biology). We anticipate an expanded array of such targets could be revealed with larger, trans-ethnic GWAS.

Conclusion

In an extensive genetic colocalization analysis, we have identified loci, genes and related pathways related to hematopoietic development. Further, our colocalization results identified loci relating 70 hematopoietic, cardiovascular, autoimmune, neuropsychiatric and cancer phenotypes. This repository of associations will be useful for mechanistic studies aimed at understanding biological links between phenotypes, for developing novel therapeutic strategies, and for predicting off-target effects of small molecule and gene therapies.