Background

The annotation of the human genome by ENCODE, Roadmap Epigenome, FANTOM and other consortia [16] has revealed the widespread distribution of transcription factor binding loci throughout the genome. These patterns, known as cistromes, include numerous enhancers that are often extremely distal to genes [713]. The diversity and function of these distal enhancer sites suggests a hitherto unsuspected level of complexity to transcriptional control (reviewed in [14]). Recently, it has emerged that transcription factors including nuclear hormone receptors can bind at enhancers in both direct contact with DNA (cis) and indirectly in contact with another protein that in turn bind to DNA (trans). Furthermore this type of trans binding is often absent of canonical motifs but associated with significant levels of transcription factor clustering [1517]. In parallel, large scale genome-wide association studies (GWAS) of genetic variation have revealed that the vast majority of SNPs are contained in areas of the genome that are outside of gene exons, and therefore do not have the potential to make a direct contribution to protein structure [18]. Taken together, these findings of genomic distribution of transcription factor binding sites and widespread genetic variation at non-coding sites raises the possibility that phenotype- and disease-associated SNPs at distal regions impact transcription factor binding that in turn is associated with disease [18, 19].

Testing the possibility that genetic variation impacts transcription factor function that underpins trait differences and disease phenotypes is analytically challenging, given the size of the datasets and the potential for false discovery. Various groups have addressed this challenge; notably, both the ENCODE and Roadmap Epigenome consortia leveraged the remarkable volume of ChIP-seq xdata they generated and merged the binding sites with GWAS data to reveal and rank sites where SNPs appear to have a significant impact on the activity of multiple transcription factors [4, 20, 21]. Whilst these analyses represented a highly comprehensive approach, undertaking ChIP-seq with several hundred DNA binding factors, this still represents a fraction of the approximately 3000 different transcription regulatory proteins in humans.

Rather than taking a SNP-centric approach [22], we have approached this problem by focusing on a specific transcription factor (one that is outside of those analyzed by the ENCODE and Roadmap Epigenome consortia). This is potentially a complementary and attractive approach, as it overlays with the work flow of wet-lab based biologists who generally approach biological questions through the lens of an in-depth understanding of a single transcription factor, or transcription factor family. Specifically, we hypothesize that the biologically significant relationships between transcription factor genomic interactions and phenotype-associated biology can be discovered by integrating the known binding sites and all trait- and disease-associated common SNPs. This exploits the value of both datasets to identify the biologically significant intersection of transcription factor binding and genetic variation.

In the current study we have addressed the challenge of dissecting the multiple actions transcription factor function by examining nuclear hormone receptor actions. The nuclear hormone receptor superfamily acts as an integrated genomic sensor of dietary, environmental and hormonal signals. These receptors represent some of the most successful examples of targeted therapies [2326], and in human disease they represent the target for approximately 15% of all pharmacologic drugs [27]. Furthermore members of this superfamily are expressed in virtually all cell types and functionally are highly integrated both with each other [2830] and with the actions of other signaling pathways [31].

As a model transcription factor we selected the vitamin D receptor (VDR/NR1I1) [32] from this superfamily, and investigated the association between genetic variation at VDR binding sites and disease susceptibility. The VDR is an attractive transcription factor with which to dissect pleotropic functions because its actions have been identified or implicated in many physiological processes. VDR actions include classical endocrine functions to regulate serum calcium levels, and are also related to the control of cell proliferation and differentiation [33], anti-bacterial functions [3438] and immuno-modulatory functions [39, 40]. As a result, roles for VDR function and dysfunction have been implicated in a wide range of complex phenotypes including autoimmunity, diabetes, cardiac health and cancer [23, 33]. Reflecting the potential importance of this receptor in public health, there are a number of ongoing large-scale prospective studies that aim to address whether supplementation of serum vitamin D levels can have a significant health impact [4143].

Attempts to explain this have focused heavily on how the VDR impacts gene regulation. At the level of candidate target genes, the VDR has been demonstrated to functionally interact with a wide range of other transcription factors, including SP1 [44, 45] GATA4 [46], HNF4α [47], CTCF [48], FOXO4 [49], STAT3 [50, 51], and NF-κB [52, 53]. This latter interaction with NF-κB is also supported by a number of studies that examined the interactions between VDR signaling, and the control of inflammatory phenotypes, and specifically the cross-talk with NF-κB actions. These interactions are relatively well described in intestinal systems and include direct antagonism between VDR and NF-κB over the controls of shared target genes [5460].

Efforts to understand the DNA sequences bound by the VDR have built on findings from other nuclear hormone receptors [6163], and biochemical approaches on candidate target gene promoter regions. These approaches identified a dual hexameric DNA motif spaced by 3 bp, a so-called DR3 motif [64, 65], is bound with high affinity by the VDR. However, other potential motifs have also been identified [66], and the application of ChIP-seq approaches to nuclear hormone receptors has revealed binding site diversity and the importance of flanking regions for cofactors to be biologically important to determine function [67]. For example, VDR ChIP-seq studies have been performed in different human cell types [811, 68], in the presence and/or absence of ligand, and revealed the impact of ligand binding on VDR genomic targeting.

Each analysis revealed approximately 2,000 to 6,000 genomic loci normally distributed around transcription start sites (TSS), reflecting the binomial distributions found for other transcription factors [21, 68]. Another important finding from these studies is that DR3 motif appears to be the minority genomic element that directly binds the VDR, and perhaps only 20% of genomic VDR binding sites contain a DR3 motif. This low number may in part reflect algorithm limitations for de novo motif discovery. Although this number is increased following ligand treatment (reviewed in [23]), this nonetheless suggests that the classical DR3 motif is the minority motif bound by the VDR. This also suggests that there is considerable diversity in the DNA sequences bound by VDR complexes, and that VDR participates in both cis and trans genomic interactions.

The apparently interactive nature of the VDR with other transcription factors and the diversity in observed binding sites for the VDR were catalysts for the current study. Given the multiple actions that the VDR is implicated in and the heterogeneity of VDR genomic binding sites we further hypothesized that genetic variation may be exploited to identify critical VDR genomic interactions.

Therefore the present study aimed to integrate VDR genomic binding data (ChIP-seq) with GWAS SNPs to provide novel insight into the interaction between disease/phenotype associated SNPs and VDR binding. Furthermore, as it is clear that a SNP in linkage disequilibrium with a GWAS SNP may be contained within a VDR binding region and we also included these SNPs in the total list of SNPs examined (Fig. 1 Top). From these analyses, we applied transcription factor motif searching and exploited other ChIP-seq data to identify significant interactions between the VDR and other transcription factors, notably NF-κB. Taken together, we provide a statistically robust approach and strong analytical framework centered around transcription factor binding and the impact of genetic variation to infer functionally significant phenotypic consequences of genetic variation on transcription factor function.

Fig. 1
figure 1

Schematic workflow and key findings of the study. (Top panel) shows the SNPs and ChIP-seq datasets. Examples of possible overlap of Lead SNPs and SNPs in LD with Lead SNPs at genomic VDR loci are shown with examples of LD blocks. (Bottom panel) shows the number of SNPs identified from individual data sets (GWAS catalog, HapMap3 and 1000 genome project). It also shows the number of SNPs under VDR peaks and SNPs in a DR3-type motif

Results

VDR binding overlaps with Lead and LD SNPs associated with immune phenotypes

The analyses of the individual data sets from five VDR ChIP-seq studies [811, 68] (Additional file 1: Table S1) yielded a VDR consensus cistrome consisting of 23,409 non-overlapping genomic binding sites [68]. Lead SNPs (10,432) were those downloaded directly from the GWAS catalog, and LD SNPs selection resulted in 181,050 SNPs, respectively, for a total of 191,482 SNPs associated with 930 unique phenotypes (as defined in the GWAS catalog) from 1,566 studies (Fig. 1 Bottom and Additional file 1: Table S2). The overlap of genomic coordinates of these 191,482 SNPs with the 23,409 VDR loci identified a total of 574 disease- or phenotype- associated SNPs representing 211 unique traits under 506 unique VDR loci (Fig. 1 Bottom and Additional file 1: Table S3). Hypergeometric analyses corrected for multiple testing yielded 42 unique SNPs associated with seven different phenotypes. That is, the 42 SNPs that are significantly associated, at the genome-wide level, with a disease- or phenotype-associated trait, are also significantly overrepresented at VDR binding loci in individual and/or the combined dataset (Table 1).

Table 1 List of trait SNPs showing significant enrichment for associated traits

To attempt to gauge if a VDR-SNP-trait relationship was tissue specific we considered how common the relationships were across the different ChIP-seq data sets. The seven traits significantly associated with the 42 disease- or phenotype-associated SNPs were identified in the lymphoblastoid cells and also in one other data set, except for leprosy, which showed significant enrichment in only the VDR ligand-stimulated colorectal adenocarcinoma LS180 cells. Esophageal cancer (squamous cell), showed significant enrichment in more than one VDR dataset (p = 0.0001), and it was also amongst the significant traits within the ligand-stimulated ChIP-seq dataset in the CEPH cell line. Five of the other seven traits identified that associated with VDR peaks had predominant immune phenotypes including Helicobacter pylori serologic status, self-reported allergy and Celiac disease (Table 1 and Additional file 1: Table S4).

Phenotype associated SNPs identify regions where the VDR binding is coincident with other transcription factors

Previously, we mined the consensus VDR cistrome for canonical DR3 motif using the de novo motif prediction tool, HOMER [69]. Searching under a low stringency motif score setting 38% of these sites (8,897) contain canonical DR3 motifs [68] and 2.6% (n = 15) of the 574 disease- or phenotype-associated SNPs were under a VDR peak that contained a DR3 motif (Table 2). Of these 15 SNPs, rs10174949 (Lead) and rs16829980 (r 2 = 1 with another Lead SNP, rs59374417), were contained within the 42 SNPs reported above that survived multiple test correction.

Table 2 SNPs present in DR3-type motifs contained within VDR peaks and associated with traits

Given that 40 of the 42 significantly associated GWAS SNPs were not linked to a DR3 motif, we mined these regions for significant enrichment of other transcription factor associations. In the first instance we mined these regions for transcription factor binding using RegulomeDB [20]. Analysis of these 42 SNPs with RegulomeDB revealed that 39/42 SNPs were predicted to lie in a region bound co-incident with other transcription factors (Additional file 1: Table S5). Filtering these interactions to only consider SNPs that had a high RegulomeDB score revealed phenotype-associated SNPs that significantly impacted transcription factor binding (Table 3). Six had scores from 1b-1f (likely to affect binding and linked to expression of a gene target e-QTLs) and nine had scores from 2a-2b (likely to affect binding) (Table 3 and Additional file 1: Table S5). It is also interesting to note the breakdown of all NHGRI SNPs studied (Table 3). 7324/191482 SNPs (3.8%) are in transcription factor binding regions and are predicted to change binding significantly with a score of 1 or 2. This is increased to 37% when considering these SNPs under VDR peaks and remains at 35% when considering VDR peaks significantly associated with a trait. These findings further support that the concept that these SNPs are functionally relevant for the disruption of transcription by multimeric transcription factor complexes that contain the VDR.

Table 3 The distribution of Regulome DB scores for different SNP lists

The diversity of the transcription factors identified suggest a high level of VDR-protein interactions that may represent VDR binding in an indirect or trans relationship with gene regulation. That is, the VDR is recruited to another protein complex, which is in turn in direct contact with DNA. Focusing on the 15 trait-associated SNPs that also had RegulomeDB scores of ≤2 suggest common VDR interactions with 95 different transcription factors that were co-enriched in binding sites for the VDR. For example, Fig. 2 illustrates the TLR1 gene locus, and in particular the location of the rs5743565 SNP that is in LD with Lead SNP rs2101521, reported to be significantly associated with self-reported allergy [70]. This particular region of the gene also illustrates that it is commonly found to be associated with the open chromatin histone mark H3K27ac and is coincident with the significant binding of more than 30 different transcription factors as revealed by ENCODE investigators and reported in RegulomeDB.

Fig. 2
figure 2

Function of a SNP contained within a VDR binding peak associated with Toll-like receptor 1 (TLR1). The top panel shows the presence of rs5743565 (blocked out), which is in LD with rs2101521 (orange arrow) associated with self-reported allergy in GWAS study. VDR peaks are shown as Red block custom tracts. The rs5743565 is associated with expression of the genes TLR10, TLR6 and TLR1. The bottom panel shows the ENCODE ChIP-seq data for binding of different transcription factors in the region harboring rs5743565 and VDR binding sites (in red). The presence of rs5743565 in the binding region is predicted to alter the association of multiple transcription factors and the expression of associated genes, and therefore been assigned a score in RegulomeDB of < = 2

The RegulomeDB approach is highly comprehensive approach and based on the actual binding of transcription factors as identified by ENCODE. We complemented this by mining the 42 SNPs with HaploReg [71] to identify the predicted impact of these SNPs on regulatory motifs. The predicted impact of the 42 disease- or phenotype-associated SNPs on the affinity of regulatory motifs is shown in Additional file 1: Table S6. Out of 42 SNPs analyzed, 37 SNPs showed potential to affect one or more motifs. The HaploReg analyses also showed that the 42 SNPs were highly enriched for strong enhancers and DNase hypersensitive regions across multiple cell lines (Additional file 1: Table S7).

Together the analyses of 42 disease- or phenotype-associated SNPs with RegulomeDB and HaploReg suggest that these SNPs affect binding motifs of important transcription factors. Also, they are present in enhancer and DNAse sites, suggestive of active transcription factor binding and they have a significant relationship with an altered gene regulatory capacity.

Co-enrichment of VDR and NF-κB binding in CEPH cell lines

RegulomeDB analysis [72] of the transcription factors associated with all 42 SNPs established the most significantly impacted transcription factor binding. NF-κB was the most frequently enriched transcription factor when considering either all 42 SNPs or focusing on the 15 SNPs with RegulomeDB scores of 1 and 2. We chose to focus on the transcription factor enrichment amongst the 15 SNPs with high RegulomeDB scores as we reasoned these are likely to be functionally relevant. To illustrate the frequency, we have adopted a similar approach to motif analyses in which the size of the nucleotide letter is proportional to significance. In the current study we have exploited a wordcloud approach to visualize the level of significance of the transcription factors. Reflecting the active levels of transcription from these regions the RNA polymerase POLR2A is very commonly represented. However this does not bind in a motif-specific manner, rather it is recruited through protein-protein interactions. Clearly, after POLR2A, NFKB1 is prominently featured (Fig. 3a) as indicated by the font size of the transcription factor.

Fig. 3
figure 3

The co-occurrence of VDR and NF-κB1 binding at sites of significant genetic variation. a The wordcloud characterizes the transcription factors binding at the location of significantly associated Lead SNPs (and not linked to a DR3 motif) were identified by RegulomeDB; 95 different transcription factors, representing eight unique subgroups, overlapped with, and their function was altered by, these SNPs. The font size of transcription factor related to the number of times it was identified to associate with a significant SNP site. b The intersection of VDR and NF-κB ChIP-Seq from CEPH cell lines. The VDR ChIP-seq in the unstimulated and ligand-stimulated states in the CEPH cell lines GM10855 and GM10861 were intersected to generate a consensus cistrome for VDR binding sites in the unstimulated and stimulated states. Similarly, a consensus NF-κB cistrome was generated by intersecting the ChIP-Seq from the cell lines GM12878, GM12891, GM12892, GM15510, GM18505, GM18526, GM18951, GM19099, GM19193. These consensus cistromes were then intersected to reveal the unique and shared binding sites, and the overlap with the identified SNPs. c The effect of rs10174949 genotype on NF-κB binding in ChIP-seq from HapMap cell lines. The altered allele of rs10174949 was predicted to decrease the strength of predicted affinity of NF-κB1 by HaploReg. The ENCODE ChIP-seq data sets of NF-κB for HapMap cell lines for which genotype data was available were downloaded into Integrative Genomics Viewer (IGV). Population ID and the genotype for each cell lines is shown. Cell lines with homozygous reference allele are shown in blue, heterozygous samples are shown in green and sample with homozygous altered allele is shown in red. Samples with the homozygous altered allele (AA) completely lost NF-κB binding compared to the homozygous (GG) or heterozygous (AG) reference alleles

These findings suggested that disease and trait associated SNPs were significantly associated with VDR and other transcription factor, most prominently with NF-κB. To test how the common the co-occurrence of the VDR and NF-κB binding sites we exploited CEPH cell lines for which there are relevant ChIP-seq data. These cells were chosen as they are not transformed and therefore transcription factor binding is not likely to be altered. Specifically, we examined the genome-wide overlap between basal and 1α,25(OH)2D3-stimulated VDR and TNFα-stimulated NF-κB. The Venn diagram illustrates the overlap between these three cistromic data sets (Fig. 3b). Notably, the intersection between 1α,25(OH)2D3 stimulated VDR and TNFα stimulated NF-κB cistromes is pronounced, with 5,635 genomic regions shared. Furthermore, these shared binding regions also contained 22 of the 42 disease- or phenotype-associated SNPs (Fig. 3b). These findings further suggest the potential biological importance of the interaction between these two transcription factors. Finally, all of the SNPs reported to affect gene expression (e-QTL) were associated with immune phenotypes (Table 4).

Table 4 Showing e-QTLS and associated phenotypes for SNPs under VDR peak significantly enriched for trait association

Figure 3c shows an example of how genetic variation may influence NF-κB binding. Specifically, rs10174949 on chromosome 2p lies in an intronic region of LINC00299 and is reported to be significantly associated with self-reported allergy [70]. HaploReg revealed that the LOD score was 12 in the presence of reference allele, whereas the altered allele reduced the LOD score to 0.3. These findings suggest the altered allele (A) results in a loss of predicted binding strength compared to the reference allele (G). These predictions were supported by analyses of rs10174949 alleles in NF-κB ChIP-Seq data available on nine different CEPH cell lines from HapMap. Four out of five HapMap cell lines with the wild-type allele (GG) have a pronounced NF-κB binding peak at this region (Fig. 3c, peaks in blue), two of the three heterozygote cell lines had intermediary binding (peaks in green) whereas the one cell line with the homozygote (AA) genotype showed little or no binding (peak in red).

Discussion

The goal of the current study was to test the hypothesis that the pleotropic actions of the VDR could be dissected by integrating VDR ChIP-Seq studies with GWAS findings in a robust statistical framework. The approach we describe is significant for several key reasons; the potential impact on the understanding of VDR biology, the generalizability of the approach to other nuclear hormone receptors and its potential for resonance with biological investigators.

Firstly, the impact of genetic variation on VDR function has been very intensively investigated almost exclusively in the context of SNPs either within the VDR gene itself or genes that encode proteins that in turn regulate VDR function (reviewed in [7376]). Whilst these investigations highlight a role for VDR in human health and disease, they are ultimately limited in terms of scope of how genetic variation is considered and perhaps as a result, few of the findings have been replicated at the genome-wide level. By contrast, the current study identifies genetic variation that is significant at the genome-wide level is enriched in VDR binding sites and impacts VDR function. This finding informs understanding of the pleotropic actions of the VDR [4460]. This has not been reported previously.

Secondly, the approach is applicable to other members of the nuclear hormone receptor superfamily, as well as other disease relevant transcription factors. Interestingly, despite their developmental and therapeutic relevance, nuclear hormone receptors are not comprehensively covered by either ENCODE or Roadmap Epigenome consortia. However, given the preeminent position of nuclear hormone receptors in human health and disease, other members of this superfamily have been extensively investigated by ChIP-seq approaches by other investigators. Indeed, there are hundreds of ChIP-seq datasets available for nuclear hormone receptors that have not been considered in the RegulomeDB approach that builds on ENCODE data. For example, the androgen receptor has been extensively investigated. This receptor is intimately associated with prostate cancer and other areas of men’s health [77, 78] and is associated with breast cancer in women [79, 80]. Multiple investigators have examined this receptor and to date there are over 130 individual ChIP-seq experiments for this receptor. Similarly, the estrogen receptor, which is intimately associated with breast cancer, bone health and a range of other phenotypes in women’s health [81, 82] has also been extensively investigated with over 200 ChIP-seq data sets available. Outside of the sex steroids, the glucocorticoid receptor is arguably one of the most prominent drug targets in human health being the central therapeutic target for a wide range of anti-inflammatory therapies [83] and its actions are central to wide spectrum of human phenotypes and diseases. Similarly, approximately 40 ChIP-seq datasets are currently available for this receptor. Similar analyses can be undertaken for other nuclear hormone receptors including peroxisome proliferator-activated receptors (PPARs) and the retinoic acid receptors (RARs) and can be extended to other clinically important transcription factors. Therefore the approach described in the current study has a significant potential to resonate with investigators who work with these other receptors.

We propose that these ChIP-seq datasets are very attractive to build consensus cistromes in different cell states for a given transcription factor. Specifically, the intersection of each of the different cistrome data sets established in different cell backgrounds can be generated to increase the statistical accuracy of the binding sites. Integration of these consensus cistromes with the NHGRI GWAS catalog has the strong potential to highlight how these receptors relate to human health and disease and specifically, what critical pathways are modulated by genetic variation. In this regard the current study is an important proof of principle that can be relatively quickly undertaken to highlight new avenues of biological function that can be exploited in diagnostic or therapeutic settings.

Finally, the current study has strong potential to resonate with biologists who study in depth either single members or related members from transcription factor superfamilies. These researchers bring considerable insight to research questions around a given transcription factor, but perhaps are less comfortable considering the intersection of multiple genome-wide data sets as a functional genomics approach. The current study therefore has the potential to serve as a guide for data integration and functional genomic discovery.

Applying this approach allowed us to be able to overlay a consensus VDR cistrome with data from the NHGRI GWAS catalog of SNPs, and identify disease and phenotype-associated variants that were genome-wide significant in the GWAS catalog and present at genomic VDR binding loci. Of these relationships, the most pronounced and significant related phenotype to emerge was associated with immune functions and capacity. What is also important is what did not emerge. Namely genetic variants associated with cancer risk were not commonly identified.

Within this framework, our study has demonstrated a number of intriguing findings. Using genome wide analyses of germline genetic variation and ChIP-seq data we identified the VDR binding loci significantly enriched for 42 disease- or phenotype-associated SNPs (significant GWAS findings and/or LD SNPs). We selected SNPs that were in LD, rather than in a certain genomic window (e.g., 10 kb from the binding site), as we wished to leverage understanding of genomic structure to inform how genetic variation impacted VDR function. Of these 42 SNPs, only two were associated with canonical DR3-type VDR binding sequences, with the majority of SNPs associated with other transcription factors, some of which are known to interact with the VDR. The analyses of shared enrichment of transcription factors by RegulomeDB and Haploreg also suggested that VDR interacts, perhaps in shared multimeric complexes with other transcription factors and potentially warrant further exploration given the shared platform of identification. The finding that the significantly enriched disease- or phenotype-associated SNPs are not commonly found in canonical DR3 VDR motifs is intriguing. All the published VDR ChIP-Seq identified significant enrichment of the DR3 motif but it was at best found in approximately 30% of VDR peaks [23]. However, in the current study, the number of sites with disease- or phenotype-associated SNPs did not appear as high as this, and arguably was even less common. These findings collectively suggest that the VDR may interact with the genome in a number of direct and indirect mechanisms. The indirect, trans, mechanisms remain enigmatic but the current findings suggest a functional interaction between VDR and NF-κB.

We have confidence in the strength of the findings. Analysis of the 42 disease- or phenotype-associated SNPs using ENCODE ChIP-Seq data demonstrated that 15 of these SNPS are “likely to affect transcription factor binding,” designated by a Regulome score ≤2. Therefore 33.3% of our final SNP set is likely to affect transcription factor binding. Across all GWAS and LD SNPs, ENCODE characterizes ~17% as likely to affect transcription factor binding. Therefore, a simple one-tailed Fisher’s exact test supports that indeed we can reject the idea that our SNP set is like that of the GWAS and LD background, meaning our SNPs show evidence of enrichment for affecting TF binding. Furthermore (as we have demonstrated) these functionally relevant SNPs are more frequently found when VDR appears to co-bind with another transcription factor explored, for example NF-κB.

Naturally, there are a number of caveats to these approaches and findings. For example, not all tissues express all the transcription factors highlighted by RegulomeDB and therefore interactions will be tissue specific. Similarly, another caveat is that DR3 binding elements may be under-represented in computational analyses due to technical reasons and incomplete understanding. However, it is worth stressing that each group that undertook the primary VDR ChIP-seq analyses reported that VDR regions containing DR3 elements represented a minority of VDR loci.

In a similar manner the significant enrichment SNPs associated with immune traits under VDR peaks in part reflects the role of VDR in immune function [84]. However these findings may reflect that VDR ChIP-seq were included from two lymphocyte CEPH cell lines. These cells contributed more VDR ChIP-peak enriched SNPs (442 SNPs) than either THP1 (43 SNPs), LS180 (80 SNPs) or LX2 (36 SNPs). Therefore we cannot exclude the possibility that this underpins the enrichment of immune phenotypes. Of course to counter this is the biological possibility that the VDR is intrinsically more genomically engaged in the GM cell lines, and hence the largest number of VDR peaks is identified in these cell models. Furthermore the GWAS catalog contains many more SNPs associated with non-immune phenotypes, for example cancer phenotypes, and these were not readily identified.

An intriguing new hypothesis generated by the current study is that genetic variants at the sites of VDR and NF-κB binding distort the control the of Toll-like receptor 1 (TLR1) and Long intergenic non- protein coding RNA 299 (LINC00299) to govern immunophenotypes. For example the genetic variants may change the on-off rates of the VDR- NF-κB complex binding at the site such that it alters the potential to regulate the downstream gene target. Identification of the sites of interaction in the current study actually make testing the significance of the sites through genome-editing approaches relatively easy for future studies (Figs. 2 and 3). In this manner it may be intriguing to edit the genome such that variants are added in and then the avidity of VDR and NF-κB can be tested with ChIP and Re-ChIP approaches as required. Such experiments would move towards understanding how genetic variation differentially impacts cis and trans binding events that in turn govern gene expression. Genome editing approaches may also be attractive for future studies as the SNPs in Table 2 are not highly penetrant, that is they are common variants with low/modest odds ratios, and are in syndromes that tend to be chronic and even adult onset. We believe that this reflects that VDR binding is part of a complex picture of transcription factor interactions that relate to disease. Thus looking for VDR binding at alleles associated with low/modest risk has a high probably of not demonstrating allele-VDR binding associations, not because the in silico experiments were incorrect but rather because these are complex disease phenotypes with several factors at play. Again, although the literature supports cooperation between these VDR and NF-κB signaling pathways, the current approach provides statistical significance to this interaction, and justifies the significant approaches required to test this in relevant biological settings. Precise genome editing approaches will allow the definitive testing of these possibilities.

The potential interactions between VDR and NF-κB are given further relevance as the principal disease phenotypes associated with the SNPs are enriched for immune-related phenotypes (Table 1). For example, leprosy and vitiligo have different pathophysiology but both have major immune components. Vitiligo is an autoimmune disease [85, 86], whereas defects in cell-mediated immunity have been reported in leprosy [87, 88]. Additionally, studies have shown the interaction of Toll-like receptors (TLRs) and the vitamin D anti-microbial pathway contribute to leprosy outcomes [88]. Given the large number of individuals around the world who are currently enrolled in vitamin D intervention and supplementation trials [41, 42, 8996] the current findings add to the genomic framework for interpreting the health responses, and for investigating the genetic basis for sensitivity and resistance to vitamin D exposure and VDR function. A further interesting finding of the current study is to some extent in the negative findings. Despite the large body of literature supporting roles for the VDR in various cancers (reviewed in [24]), only one phenotype associated SNP related to cancer and this suggests that genetic variation does not impact VDR signaling in cancer phenotypes to a significant extent.

Conclusion

Taken together the current study has examined VDR transcriptome behavior through integrative genomic approaches and demonstrated a number of intriguing and statistically robust findings. Firstly, that the VDR binding peaks were significantly enriched for 42 diseases or trait associated SNPs, identified through GWAS studies. Secondly that only two of these SNPs were associated with canonical DR-3 type VDR binding elements and the majority of SNPs were associated with other transcription factors some of which are known to interact with the VDR, whereas others are novel. Finally the disease phenotypes are overwhelmingly enriched for immune-related phenotypes. Given the large number of individuals around the world who are currently enrolled in vitamin D intervention and supplementation trials, the current findings provide a framework for interpreting the health responses and for investigating genetic the basis for sensitivity and resistance to VDR function.

Methods

Study design

The study design is summarized in Fig. 1 and Additional file 1: Table S2. Briefly, we merged peaks identified in multiple VDR ChIP-seq data sets to build a consensus VDR cistrome which was then overlapped with both SNPs identified by GWAS (we call the SNPs that are reported to be statistically associated with a phenotype lead SNPs) and SNPs in LD with lead SNPs. We then identified lead plus LD SNPs under VDR peaks and Hypergeometric testing was used to identify phenotypes enriched for presence under VDR peaks. SNPs were further annotated based on presence of absence of a VDR motif e.g., a motif was identified as present under peak or not. Statistically significant lead and LD SNPs were further annotated for function and interaction with other transcription factors.

Statistical and bioinformatics analyses

VDR ChIP-seq data sets and peak identification

As the individual ChIP-Seq data sets had been analyzed using different workflows, we chose to re-align the reads, and define enriched peaks with the same harmonized workflow. In order to allow direct comparison across all the published VDR ChIP-seq data, each VDR ChIP-seq data set was analyzed again [68]. In summary, VDR ChIP-seq data from five studies (six data sets) were selected for analysis, including GM10855 and GM10861 lymphoblastoid cells [9], THP-1 monocyte-like cells [8], LS180 colorectal cancer cells [11], LX2 hepatic stellate cells [10] and LPS-differentiated THP-1 cells [68]. Sequence reads were aligned hg19 Bowtie software version 1.0.0 [97] and significant peaks were identified using MACS version 2 [98] with the following essential command line arguments: macs2 callpeak –bw 150 –keep-dup 1 –g hs —qvalue = 0.01 –m 5 50 —bdg. Otherwise, default parameters were used. From these the analyses the combined data set was generated which contained in total 23,409 non-overlapping VDR peaks across these samples and analyzed using HOMER to identify DR3-type sequences [68]. Briefly, each data set was searched for de novo motifs [99] on ±100 bp peak summit regions using the following essential command line arguments: annotatePeaks.pl < sample_name > hg19 –noann –nogene –m < motif_file > −size −100, 100. Detailed methods for this analysis are provided in our previous publication [68].

Selection of lead SNPs and linkage disequilibrium SNPs

To create a comprehensive SNP list associated with disease/phenotypes at genome-wide significance level we downloaded the NHGRI GWAS Catalog (dated: 12/15/2015) [100]. We used SNP Annotation and Proxy Search (SNAP) tool to identify SNPs in strong linkage disequilibrium (LD) with these GWA SNPs we used HapMap and 1000 genome data from the Centre d’Etude du Polymorphisme Humain (CEU) population [101]. LD (r2) and minor allele frequency (MAF) thresholds for SNP selection were 0.8 and 0.05, respectively, considering all SNPs within a 500 kb region surrounding the Lead SNP. This yielded a final list of significant SNPs from replicated GWAS plus LD SNPs (SN).

Identification of statistically significant SNPs in VDR binding regions for each GWA phenotype

We used a hypergeometric test to assess an over or under representation of GWAS and LD SNPs under peaks for each trait/phenotype and to determine if SNPs associated with a particular phenotype were more frequently under a VDR Chip-seq peak than expected. All analyses were done in R using GenomicRanges [95]. To test for evidence of statistically significant enrichment of SNPs associated with a particular trait/phenotype under VDR peak we used a hypergeometric test using phyper function in R and applied bonferroni correction for the number of phenotypes tested. In addition to SN and SP for each trait we identified the total number of both GWAS and LD SNPs (Sn) and then how many of those SNPs overlap with VDR peaks (Spi for all i = 1 to 211, where i reflects each phenotype of the 211 phenotypes tested). This information was then used to perform the hypergeometric test using phyper function in R for each trait. A Bonferroni correction was used to control for the number of phenotypes tested.

Functional annotation of significant SNPs

To understand the function consequences of the SNPs showing evidence of enrichment under VDR peaks we used two different in silico methods with a specific focus on chromatin status around the SNPs and how the SNPs affect binding motifs of different transcription factors. First, we utilized RegulomeDB [20] to identify important chromatin signatures and transcription factors binding to the SNPs. RegulomeDB utilized ChIP-seq data for histone marks and transcription factors across multiple cell lines from ENCODE [1, 102]. Based on these data sets RegulomeDB scores each SNPs for their function importance. SNPs with scores of 1a-f, 2a, 2b and 2c are thought to likely affect transcription factor binding. In addition we used HaploReg v2 to determine if the presence of each SNP affects the binding motif of a transcription factors based on position weight matrix [71]. HaploReg constructs position weight matrix for both reference and altered allele for each SNP from multiple data sources and provides a log-odds (LOD) score. The change in log-odds (LOD) score as alleles change (LOD(altered allele) – LOD (reference allele)) was determined and as was the number of affected motifs. A negative LOD difference value indicates the predicted relative affinity is higher for the reference sequence whereas a positive value means that the predicted relative affinity is higher for the alternate allele.

Intersection of VDR and NF-κB binding sites in CEPH cell lines

We focused on the potential of VDR binding site-associated SNPs to impact NF-κB binding by examining their co-incident binding. To test the possibility of significant shared genomic binding VDR and NF-κB we established a consensus cistrome for each factor and then intersected these data sets. Specifically, basal and 1α,25(OH)2D3-stimulated VDR cistromes was generated by intersecting the ChIP-Seq from GM10855 and GM10861 cells. TNFα-stimulated NF-κB cistrome was generated by intersecting the ChIP-Seq from the cell lines GM12878, GM12891, GM12892, GM15510, GM18505, GM18526, GM18951, GM19099, GM19193. Overlaps of the different datasets were examined using GenomicRanges in R and were deemed positive if at least 25% of the peak genomic region overlapped.