Background

Rheumatoid arthritis (RA) is a chronic systemic autoimmune disease characterized by damage to synovial joints as well as extraarticular manifestations. The strongest known genetic risk factor is the HLA-DRB1 gene on chromosome 6, namely a set of alleles sharing a common sequence known as the shared epitope (SE) [1]. Recent whole-genome association studies have revealed new risk genes outside of the HLA region [24], and some studies have also provided evidence of additional influences from the HLA class III and class I regions [5, 6]. In this analysis we sought first to identify and/or validate RA risk alleles throughout the genome, and then to identify independent associations with RA susceptibility in the major histocompatibility complex (MHC) in addition to the SE.

Methods

Illumina 550k genotyping data from a whole-genome association study by North American Rheumatoid Arthritis Consortium (NARAC) [3] was used for this study as part of the Genetic Analysis Workshop 16; duplicated and contaminated samples had been removed previously. Using the computer program PLINK [7], subjects were filtered who had less than 90% genotyping, and single-nucleotide polymorphisms (SNPs) were filtered that had less than 90% genotyping, Hardy-Weinberg equilibrium in controls p < 0.0001, or minor allele frequency (MAF) < 0.05. Using the computer program EIGENSTRAT [8], population outliers were removed who were >6 standard deviations from any of the first five principal components (PCs) identified in PC analysis.

First we analyzed the whole-genome data using structured association analyses of EIGENSTRAT; although the NARAC cases and controls are Caucasian, differences in intra-European ancestry [9] can produce false-positive associations. SNPs used for the PCA were filtered to remove regions of extended local associations (chr 8: 8-12 Mb, chr 6: 24-36 Mb, chr 11: 42-58 Mb, chr 5: 44-51.5 Mb, and chr 17: 40-43 Mb) and pruned to have r2 < 0.2 within a sliding window of 1 kb with a step size of 100, similar to methods of Fellay et al. [10] and Hom et al. [11]. We included correction for the first six PCs (see Results).

We used a conservative genome-wide significance threshold of p = 1 × 10-8. Because all SNPs exceeding this threshold (i.e., lower p-value) were in the MHC in a region of extended linkage disequilibrium (LD), we proceeded with conditional analyses to attempt to establish signals that were independent of the shared epitope and each other. We modeled the SE as a multi-allelic marker with values corresponding to negative, low-risk, or high-risk. SE alleles were considered high risk if they were one of DRB1*0401, 0404, 0405, 0408, or 0409. Table 1 shows the case-control ratios for each risk category.

Table 1 HLA-DRB1 risk levels. Definitions and case-control ratios for shared-epitope (SE) categories.

Our conditional analyses, using the computer program Whap [12], proceeded as follows. Starting with the SE as the top marker, we tested each two-SNP marker (SE plus each other SNP) for independence of the other SNP; in particular this uses a likelihood-ratio test to determine the significance of the difference between the two-SNP "alternate" model versus the one-SNP (SE in this case) "null" model. As long as the most significant SNP was <0.001, we added this best SNP to the list of independent SNPs, and proceeded to test all three-marker combinations compared to our best two-SNP model; and so on with larger haplotypes. In the final list, we also tested each locus for a significant addition to the model containing all other top SNPs.

Results

After quality control filtering (above), 486,078 SNPs remained for the whole-genome analyses. All subjects had genotyping >90%, and eight controls were removed as outliers detected by EIGENSTRAT, leaving 868 cases and 1186 controls for the final analyses. In structured association analysis with EIGENSTRAT, we corrected for the first six PCs because the scree graph of eigenvalues levelled off at the sixth component. PCs one, two, four, and five were all highly significant in association tests with cases and controls (all p ≤ 10-8). In this whole-genome analysis we identified exactly 200 SNPs with p < 10-8, all between 30.38 Mb and 33.08 Mb in the MHC region. Table 2 shows the significance of associations in this dataset for known RA risk alleles outside of the MHC [2, 3, 1317]; for SNPs not in the Illumina 550k panel, there were perfect proxy SNPs (r2 = 1) in the HapMap CEU population [18]. Although not reaching our 10-8 threshold, we observed p-values from 10-5 to 5 × 10-6 for PTPN22 and TRAF1-C5 SNPs, and p = 0.03 for STAT4. This dataset was underpowered to detect any of these risk alleles at a genome-wide level for their published odds ratios (ORs); for example, we had approximately 70% power to detect the highest OR of 1.75 (PTPN22, MAF = 11%) at p = 10-8, and only 50% power to detect the lowest OR of 1.15 (CD40, MAF = 25%) at p = 0.05.

Table 2 Significance of associations with published RA risk alleles

Results of our conditional analyses are shown in Table 3. Loci are shown in the order added by the algorithm (see Methods), i.e., rs261946 has the lowest p-value conditional on the SE, rs2074488 has the lowest p-value conditional on both SE and rs261946, and so on. Out of the 200 SNPs, three independent signals were evident in addition to the SE risk levels. One signal is located in the classical HLA class II region between genes BTNL2 and HLA-DRA, and two signals are in the classical HLA class I region (see Figure 1) near TRIM39 and HLA-C.

Figure 1
figure 1

Region with LD of SNPs studied in conditional analyses. Haploview diagram displays increasing red with higher D'. Three independently significant RA SNPs and HLA-DRB1 locations are indicated.

Table 3 Independent SNPs in conditional analysis of RA susceptibility

Table 4 shows the case-control frequencies and ORs for the final haplotypes, with the most common haplotype (ACG-1) as the reference haplotype. The highest-risk haplotype of these SNPs and the SE level had OR = 27.2 (95% CI, 16.7-44.4) in comparison to OR = 7.3 (95% CI, 4.7-11.1) overall comparing the SE risk levels alone (data not shown).

Table 4 Frequencies and odds ratios for haplotypes of three SNPs and shared epitope proxy risk level

Conclusion

In our analysis of the Genetic Analysis Workshop 16 dataset, there was insufficient power to detect known associations with RA susceptibility at a genome-wide significance level outside of the MHC; the most significant association was p = 5.3 × 10-6 for PTPN22. Clearly, the MHC is the most influential genetic region in RA susceptibility, but extensive LD makes isolating the precise loci difficult. We have used conditional analyses as a tool to investigate the presence of multiple RA risk factors in the MHC region in addition to the SE. Out of 200 candidate SNPs having unconditional p-values < 10-8, we have identified an additional HLA class II marker and two HLA class I markers which have significant associations with RA susceptibility that are not fully explained by LD with HLA-DRB1. A better understanding of these genetic influences can be helpful in elucidating the complex genetic components of RA.

Previous studies of MHC effects on RA susceptibility beyond the SE have identified additional independent signals but have been largely inconsistent, due at least in part to the difficulty of narrowing down regions of association in the presence of extended LD [1, 13]. Multiple studies have implicated the TNF-lymphotoxin locus in class III [1], which were not significant in our conditional analysis. Other studies also using NARAC cases have observed signals in class I [5, 14], including HLA-C, our second SNP added in conditional analysis. Our first SNP is in the gene TRIM39, also in class I but not previously implicated. Our third SNP, in class II, is 150 kb upstream from HLA-DRB1 between the BTNL2 and HLA-DRA genes. BTNL2 has been associated with RA, systemic lupus erythematosus, and type 1 diabetes [15]; this is attributed to its association with predisposing HLA DQB1-DRB1 haplotypes, which may explain its presence in our data as well.

It is important to note that the NARAC population is primarily Caucasian. Other populations could have quite different distributions of these haplotypes as well as other haplotypes and allele frequencies. A similar analysis in other ethnic groups could be very informative.