A multi-step genomic approach prioritized TBKBP1 gene as relevant for multiple sclerosis susceptibility

Background Over 200 genetic loci have been associated with multiple sclerosis (MS) explaining ~ 50% of its heritability, suggesting that additional mechanisms may account for the “missing heritability” phenomenon. Objective To analyze a large cohort of Italian individuals to identify markers associated with MS with potential functional impact in the disease. Methods We studied 2571 MS and 3234 healthy controls (HC) of continental Italian origin. Discovery phase included a genome wide association study (1727 MS, 2258 HC), with SNPs selected according to their association in the Italian cohort only or in a meta-analysis of signals with a cohort of European ancestry (4088 MS, 7144 HC). Top associated loci were then tested in two Italian cohorts through array-based genotyping (903 MS, 884 HC) and pool-based target sequencing (588 MS, 408 HC). Finally, functional prioritization through conditional eQTL and mQTL has been performed. Results Top associated signals overlap with already known MS loci on chromosomes 3 and 17. Three SNPs (rs4267364, rs8070463, rs67919208), all involved in the regulation of TBKBP1, were prioritized to be functionally relevant. Conclusions No evidence of novel signal of association with MS specific for the Italian continental population has been found; nevertheless, two MS loci seems to play a relevant role, raising the interest to further investigations for TBKBP1 gene. Supplementary Information The online version contains supplementary material available at 10.1007/s00415-022-11109-8.


Introduction
Multiple Sclerosis (MS) is a complex autoimmune disease caused by the interplay of multiple genetic and environmental factors [1]. Large genome wide association studies (GWAS) provided important clues for the identification of genetic causes with more than 233 loci being discovered [2][3][4]. Although they are robustly associated, the identification of causative variants and the functional mechanism influencing MS pathogenesis has been identified only in few cases [5,6]. Moreover, these studies rely on the inclusion of different populations to reach very large cohorts necessary to detect small effects like those observed for MS; however, different frequencies of disease variants across the different populations may mask the existence of additional associations. This was clearly evidenced in a GWAS in Sardinians, a genetically isolated population, which led to the identification of a novel MS variant in the TNFSF13B gene, whose allelic frequency allows its identification only in Sardinians and Southern European populations [6].
In the present project, we aim to analyse a large cohort of more than 5500 Italian individuals to identify markers associated with MS with potential functional impact in the disease. To boost the power of the analysis, we metaanalysed association results from different cohorts and leveraged quantitative molecular traits on adequately large resources for functional SNPs prioritization.

Subjects
A total of 2571 Italian MS patients and 3234 unrelated HC have been recruited across Italian centres by three consortia (PROGRESSO, PROGEMUS and HYPERGENES). All participants were of Caucasian ancestry.
Based on the genotyping platform used for the genotyping, this entire set of samples was divided in four cohorts: a) ITA GWAS : it was composed of 776 MS and 1315 HC.
MS samples were previously genotyped in the context of the International Multiple Sclerosis Consortium (IMSGC) at genome-wide level at the Wellcome Trust Sanger Institute using the Illumina ® Human660-Quad chip [2]. Healthy individuals (mean (± SD) age of 53.4 ± 9.9 years and no abnormal findings on physical and neurological examination) have been collected within the HYPERGENES project (European Network for Genetic-Epidemiological Studies; www. hyper genes. eu) [7] in Italy and genotyped at the University of Milan, using the Illumina ® Infinium 1 M-duo BeadChips. b) ITA iChip : it was composed of 964 MS patients and 1054 matched controls non-overlapping with the ITA GWAS cohort. Samples were previously genotyped in the context of the IMSGC at the Wellcome Trust Sanger Institute using the Illumina ® Immunochip covering 196,524 SNPs and 184 non-HLA genomic regions with potential immunological function [3]. c) ITA OA GWAS and ITA iChip cohorts based on their genetic risk burden [8], estimated as a weighted sum of risk alleles in non-HLA known MS-associated loci [9][10][11]; 600 MS patients with a higher genetic risk (based on the distribution of the entire Italian cohort) were selected together with age-and sex-matched 408 HC, according to sex and Principal Components (PCs) to account for population stratification. ITA NGS cohort underwent pooled-sequencing as previously described [12] using the Agilent SureSelect target enrichment method (Agilent Technologies, Santa Clara, USA) according to the manufacturer's protocol.
The four cohorts are partially overlapping, as shown in Online Resource 1 fig. S1.
Demographic information for the mentioned cohorts is listed in Table 1, while details on genetic data generation are described in the Online Resource 1.
Additionally to the Italian cohorts, a total of 4088 MS patients and 7,144 HC of European descent ("NonIT-EUR" cohort) have been included in the analyses for the "metaanalysis approach" (see below). This represents a combination of 6 individual sample collections previously described [13].
All patients were diagnosed according to Poser and McDonald criteria and further revisions [14]. Figure 1 shows the workflow of the study. Briefly, we followed a multi-step approach including a discovery step at genome-wide level followed by replication and fine-mapping through next-generation sequencing of the top-associated loci. Finally, the effect of the identified variants on gene expression and DNA methylation was evaluated to prioritize markers with potential functional role.

Discovery
We initially investigated the presence of association signals specific for the Italian population by performing a GWAS in the ITA GWAS cohort alone (here called the "genomewide Italian approach"), this led to the selection of SNPs (P < 5 × 10 -7 ) to further replicate in the replication step.
An alternative approach (here called the "meta-analysis approach") was designed to increase the sample size of the discovery phase and detect association signals with a lower impact on MS susceptibility not detectable when studying only the ITA GWAS cohort. Specifically, two additional cohorts were included in the study design: the ITA iChip , composed by Italian individuals genotyped at relevant immunological genomic regions, and the NonIT-EUR dataset, which represents a large cohort of non-Italian individuals for which genome-wide genotyping data are available. According to this approach, we selected SNPs with suggestive evidence of association (P < 0.005) in the ITA GWAS cohort alone (to detect signals at genome-wide level), as well as performing a meta-analysis between ITA GWAS and ITA iChip cohorts. A meta-analysis with the NonIT-EUR dataset was then performed for the selected signals and significant SNPs (P < 5 × 10 -7 ) were considered for replication.

Replication
We defined loci for replication as described in the Online Resource 1. Due to the known strong associations within the MHC region with MS [4], SNPs located in the HLA locus were excluded.
Genotyping was performed using TaqMan assays and the TaqMan Open Array Genotyping System (Thermo Fisher Scientific, Waltham, USA) or the MS Chip [4] (Illumina, San Diego, USA). Single-SNP logistic regression using sex as covariate was conducted under additive model of inheritance. Meta-analyses between the discovery and replication cohorts were performed assuming a fixed-effect model.

Fine mapping with next generation sequencing (NGS)
We sequenced the Linkage Disequilibrium (LD) blocks containing the replicated associated signals in the ITA NGS cohort using the Agilent SureSelect kit (Agilent Technologies, Santa Clara, USA) and a pool-based approach on the GaIIx platform (Illumina, San Diego, USA). Bioinformatics pipeline and variant calling were performed as described [12]. Fisher exact test comparing MS and HC was performed on allelic counts, no correction based on sex or ancestry was included since HC were individually matched with MS patients, as described in the "Subjects" paragraph. For methodological details see Online Resource 1.

Prioritization
Identified associated variants were prioritized according to their cis effect on expression or CpG methylation by performing conditional cis-eQTL and conditional cis-mQTL analyses. Cis-eQTL analysis was performed taking advantage of the GTEx Portal v6p [15] dataset (release phs000424. v6.p1) for 7 MS-related tissues (whole blood, Brain Fron-tal_Cortex (BA9), Brain Cortex, Cerebellar Hemisphere, Cerebellum, Anterior cingulate cortex (BA24), Hypothalamus). Cis-mQTL analysis was performed on CD4 + T-cells collected from 156 MS patients within the Comprehensive Longitudinal Investigation of Multiple Sclerosis at the Brigham and Women's Hospital (CLIMB study [16,17]). Both analyses were performed using QTLtools [18]. For details see Online Resource 1.

Discovery
As described in the method section, the "genome-wide Italian approach" includes the ITA GWAS cohort as discovery dataset. After quality control (QC), the association analysis was performed in a final dataset of 737 MS and 1291 HC individuals and ~ 6.6 million SNPs. Seventy-one variants reached a p value of association < 5 × 10 -7 , within 26 distinct loci (Online Resource 2, Online Resource 1 Fig. S2).
The "meta-analysis approach" includes both the ITA GWAS and ITA iChip cohorts as discovery datasets, with signals with suggestive evidence of association (P < 0.005) selected to be meta-analysed with the NonIT-EUR. Regarding the ITA GWAS cohort alone, 32,309 SNPs (P < 0.005) were meta-analysed with the NonIT-EUR leading to a list of 49 SNPs with a p-value of association < 5 × 10 -7 (Online Resource 3A). The meta-analysis of the two Italian discovery datasets led to the identification of 1373 SNPs suggestive of association (P < 0.005). Meta-analysis of these SNPs with the NonIT-EUR dataset resulted in 21 SNPs with a p value < 5 × 10 -7 (Online Resource 3B) within 5 loci. Overall, and given the fact that some signals overlapped between the different analyses, 127 signals across 35 loci were selected (Online Resource 2 and Online Resource 3).

Fig. 1
Workflow of the study. The overall workflow of the study is shown. The discovery phase includes two Italian cohorts (ITA GWAS and ITA iChip ) from which a set of top MS associated SNPs were identified through a genome-wide approach (in the ITA GWAS cohort only) or a meta-analytic approach including also the NonIT-EUR dataset.
The identified top SNPs (P < 5 × 10 -7 ) were further tested for replication in a third independent Italian cohort (ITA OA ) leading to the confirmation of two associated loci. Those loci were sequenced (ITA NGS cohort) and the associated SNPs underwent prioritization through conditional cis eQTL and mQTL analyses

Replication
Starting from the 127 signals identified through the discovery analyses (Online Resource 2 and Online Resource 3), we selected 60 SNPs for replication as representative of the identified genomic loci (see Online Resource 1 for details and Online Resource 2 and Online Resource 3 for the final list). After genotyping and QC (see Online Resource 1 for details), 48 SNPs within 30 loci were tested for association in the ITA OA dataset ( Table 2).

Fine mapping
The replicated locus on chromosome 3 spans from position 28024996 to 28124996 (hg19 assembly). This is an intergenic region including the non-coding genes LINC01980 (~ 123 kb) and LINC01981 (~ 150 kb) and the coding genes EOMES (~ 160 Kb) and CMC1 (~ 150 kb) as the closest downstream genes. A total of 608 variants were identified through the sequencing approach. Comparing to the post-QC imputed ITA GWAS dataset, the inclusion of a sequencing step allowed to investigate additional 74 common variants (MAF ≥ 0.01), as well as 636 rare variants (Minor allele Frequency (MAF) < 0.01), while 109 were excluded and not tested in the ITA NGS cohort due to the stringent QC applied. Fisher's exact test on sequenced data revealed the association of 12 variants (Bonferroni-adjustment alpha = 5.87 × 10 -5 ). The most associated SNPs were rs669607 and rs427221 (P = 5.02 × 10 − 13 ) (Fig. 2a, Online Resource 4).
The replicated locus on chromosome 17 spans a large LD block of ~ 500 kb (chr17:45309693-45824486, hg19 assembly) which includes 8 genes: ITGB3, EFCAB13 (C17orf57), NPEPPS, KPNB1, TBKBP1, TBX21, MRPL45P2 (pseudogene) and LOC102724508 (a non-coding RNA, with alias THCAT158). In this region, a total of 2,428 variants were identified through the sequencing approach. Comparing to the post-QC imputed ITA GWAS dataset, the inclusion of the fine mapping step allowed to investigate additional 446 common variants and 2,395 rare variants, while 341 SNPs were not tested in the ITA NGS cohort for the sequencing QC. Applying a pipeline which allows an accurate allele frequency estimation from NGS pooled data [12] to perform the association analysis, we found that 202 SNPs were significantly associated after Bonferroni-adjustment (alpha = 1.57 × 10 − 5 ) (Fig. 2b, Online  Resource 4). The strongest association signal was rs4239162 (P = 1.04 × 10 − 20 ), intronic to KPNB1.

Locus on chromosome 3
The locus on chromosome 3 is a small intergenic genomic region, with apparently no evidence of chromatin annotation (Online Resource 1 Fig. S4), except for the presence of sporadic weak enhancers. Cis eQTL analysis did not reveal any association between the 12 SNPs identified in the genetic analyses and the expression of nearby genes.

Locus on chromosome 17: functional SNPs prioritization
We sought to characterize the identified MS risk variants prioritizing those with a putative functional impact. We then performed a conditional cis-eQTL analysis in 7 tissues relevant for MS according to data generated within the GTEx project [15]. As shown in Online Resource 5A, we found 7 SNP-gene pairs associated in the tested tissues. Among them, rs4267364 showed a consistent association with MS in all the steps of the study (Table 3). This SNP is intronic to TBKBP1 and the MS associated allele (rs4267364 A ) is associated to a higher expression of this gene in blood (Fig. 3a). According to the chromatin state prediction in 14 epigenomes of immune-related cells profiled by the Roadmap Epigenomics [19] and ENCODE [20] projects, this SNP is localized in a putative enhancer, supporting its regulatory function (Online Resource 1 Fig. S5).
To explore additional mechanisms of regulation, we performed a conditional cis-mQTL in CD4 + cells collected from MS patients. The Online Resource 5B lists the significant SNP-CpG pairs resulted from this analysis. Among the identified mQTL associations, one SNP associated to MS according to the discovery strategy and at least one of the replication steps, rs8070463, was found to be associated also with the methylation status of cg08452456 (Table 3, Fig. 3b), exonic to TBKBP1. An additional SNP, rs67919208, associated with the methylation status of cg12183861 (Table 3, Fig. 3c), localized in the 5'UTR region of TBKBP1, was also found to be associated with MS (discovery step). Both rs8070463 and rs67919208 had the MS risk variant associated with a higher CpG methylation (Table 3).
All the three prioritized variants suggest an involvement of TBKBP1 gene, thus we then evaluated the expression of   TBKBP1 across immune cells according to the DICE dataset [21]. As shown in Online Resource 1 fig. S6, TBKBP1 is expressed across several immune cell types, with the highest levels in activated T cells (CD4 + and CD8 + ) and Natural Killer cells.

Discussion
Large scale international studies showed a highly polygenic architecture for MS genetic susceptibility [4]; nonetheless, the identified variants explain ~ 50% of the genetic predisposition to MS, suggesting that additional signals and disease mechanisms exist. This "missing heritability" could be explained, at least partially, by the presence of populationspecific associations that cannot be fully captured by arraybased international studies, as already shown studying the Sardinian [6] and German populations [22]. To investigate the presence of strong association signals enriched in the Italian population, we initially performed a genome-wide association study (referred along the paper as the "genome-wide Italian approach") in a cohort of Italian individuals that led us to the identification of several loci potentially associated with MS, most of them never reported in previous studies (Online Resource 2). However, replication did not confirm any of these associations, supporting the idea that no strong signals exist in the continental Italian population. Nevertheless, we can't exclude that other events occurred, including population stratification, genetic heterogeneity, or an insufficient statistical power [23].
We then followed an alternative approach, exploring also signals with suggestive evidence of association and meta-analysing them with a cohort of European descent with larger adequate sample size and statistical power. The genome-wide Italian approach would have the potential to identify signals with a strong association with MS specific for the Italian population, while the second approach combined the advantage of using an ancestrally homogeneous cohort as the driving cohort of loci selection with the statistical power gained using a larger cohort. Specifically, this second approach, referred as "meta-analysis approach", highlighted the presence of two loci representing the main results of this effort. Both lie in loci already known to be associated with MS [2][3][4]: an intergenic genomic region on chromosome 3 (~ 100 Kb) and a large locus of ~ 500 kb on chromosome 17.
To better characterize the two selected loci and to prioritize putative causal variants, we then performed a deep target sequencing followed by functional prioritization, allowing to investigate additional associated SNPs, not covered by arrays or imputation. In the case of the locus on chromosome 3, this analysis led to the identification of 12 associated SNPs, with rs669607 and rs427221 as the top associated ones [2]. No clear function is suggested, since this is an intergenic region with no annotation suggestive of specific regulatory function [4]. Nevertheless, very recently, a regulatory effect on the upstream gene EOMES in T cells For each SNP, the association data in the replication cohort and results from the meta-analyses between discovery and replication cohorts (including and not including the NonIT-EUR cohort) are reported. Results for the genome-wide Italian approach are reported in the upper panel (A), while results for the approaches including the NonIT-EUR are reported in panel B (discovery sample derived from ITA GWAS alone) and C (SNPs identified through the meta-analysis between ITA GWAS and ITA iChip ) A1: reference allele (the same as defined in Table 1); OR : odds ratio; L95: lower-bound of 95%-confidence interval; U95 upper bound of 95%-confidence interval; P: P value of association; I 2 heterogeneity index.
The SNPs with an association beyond the multiple testing correction (Bonferroni threshold of P value: 1.7 × 10 -3 , 3.6 × 10 -3 , 6.25 × 10 -3 respectively for A, B and C) are highlighted in bold. No significant association was found in the replication cohort for SNPs selected through the Italian genome-wide approach   [25]. On the contrary, the locus on chromosome 17 is more complex, with several genes and potentially regulatory regions. Sequencing of this region led to the identification of 202 signals of associations with the top associated signals localized in the downstream part of the region, as suggested also by the discovery analysis. This high number of associated signals could be explained by the strong linkage disequilibrium characterizing this locus (Fig. 2b), which further complicates the identification of the causative signal of association. We therefore resorted to exploit regulatory data for prioritization of the identified SNPs. Several studies reported an enrichment of regulatory regions within MS and other complex diseases GWAS loci [24, [26][27][28][29][30], thus we performed a conditional cis eQTL analysis to either overcome the presence of the large LD block and to combine the results with the obtained genetic association. This analysis highlighted rs4267364, which is an intronic polymorphism to TBKBP1, localized in a putative enhancer. This SNP is associated with the expression of TBKBP1 in blood, but not in brain-related tissues, with the MS risk allele rs4267364 A associated with a higher expression of TBKBP1 (P = 1.71 × 10 -23 ). We also investigated additional mechanisms of regulation performing a conditional cis mQTL analysis to identify SNPs associated with the methylation  Table 3 MS associated signals showing independent eQTL and mQTL associations Signals associated with MS according to at least two of the steps performed (discovery: meta-analysis between ITA GWAS and NonIT-EUR P < 5 × 10 -7 ; replication: P < 3.6 × 10 -3 , target sequencing: P < 1.57 × 10 -5 ) that show also independent eQTL (A) or mQTL (B) association are shown. For the SNP with eQTL effect, the SNP-associated gene pair is indicated, together, in the eQTL panel, with the backward nominal P value of association and the backward regression slope referred to A1 in whole blood. For the SNPs with an mQTL association, the SNP-CpG pairs are indicated together, in the mQTL panel, with the backward nominal P value of association and the backward regression slope referred to A1. , with two SNPs, rs8070463 and rs67919208, which were also selected by the genetic approach. These two SNPs were found to be associated with the methylation status of respectively cg08452456, exonic to TBKBP1, and cg12183861, localized in the 5'UTR of TBKBP1. Interestingly enough, rs8070463 SNP has been previously associated with ankylosing spondylitis, a rare autoimmune condition [31]. It is well established that methylation of regulatory regions is an important mechanism [32]: DNA methylation in promoter regions is known to induce gene silencing [33], while it has been reported that DNA methylation in transcribed regions seems to correlate with an induced gene expression [34]. Given their genomic position, we could hypothesize that the influence on the methylation status of the mentioned CpGs could in turn lead to an upregulation of TBKBP1, as suggested by eQTL analysis on GTEx dataset, with the MS risk allele being associated with an upregulation of TBKBP1 in whole blood (P = 1.03 × 10 -20 and 2.17 × 10 -20 respectively for rs8070463 and rs67919208), although we could not exclude that this reflects a LD with the eQTL leading SNP rs4267364 (r 2 = 0.66 for both the SNPs). Functional experiments testing methylation and expression profiles in same samples are necessary to clarify the direction of the signal. All of the three prioritized SNPs seems to be involved in the regulation of the expression of TBKBP1, suggesting that an upregulation of this gene could be relevant in MS pathogenesis. TBKBP1 codes for the TBK1 binding protein 1, an adaptor protein able to bind TBK1 and involved in the TNFα/NFkB pathway [35]. Contrary to TBKBP1, for which few reports are available regarding its function, TBK1 is involved in several immune processes, including T-cell homeostasis and migration cells from lymph nodes to the CNS in the MS murine model [36]. Although the direct involvement of TBKBP1 in TBK1-mediated signaling in migration of T cells has never been investigated, we could not exclude that an alteration of the expression of this gene could also influence this mechanism. Moreover, TBKBP1 was recently found to be an important regulator of development and survival of natural killer (NK) cells [37]. The role of NK cells in MS is still debated, however there is increasing evidence that certain subtypes of NK cells are able to modulate other immune cells and could be involved in autoimmune processes [38,39]. Supporting these hypotheses, TBKBP1 is highly expressed in activated CD4 + and CD8 + T cells, as well as NK cells (Online Resource 1 Fig. S6). Further tailored experiments are needed to elucidate the involvement of TBKBP1 in MS.
Concluding, our results suggest that no novel signal associated with MS with a strong effect and specific for the continental Italian population exists. Nevertheless, we prioritized two loci, already identified to be associated with MS, that may have ethnic specificity in the Italian population, one of them pointing to the TBKBP1 gene as a candidate to be investigated further.