From genome-wide association studies to disease mechanisms: celiac disease as a model for autoimmune diseases
- First Online:
- Cite this article as:
- Kumar, V., Wijmenga, C. & Withoff, S. Semin Immunopathol (2012) 34: 567. doi:10.1007/s00281-012-0312-1
- 2k Downloads
Celiac disease is characterized by a chronic inflammatory reaction in the intestine and is triggered by gluten, a constituent derived from grains which is present in the common daily diet in the Western world. Despite decades of research, the mechanisms behind celiac disease etiology are still not fully understood, although it is clear that both genetic and environmental factors are involved. To improve the understanding of the disease, the genetic component has been extensively studied by genome-wide association studies. These have uncovered a wealth of information that still needs further investigation to clarify its importance. In this review, we summarize and discuss the results of the genetic studies in celiac disease, focusing on the “non-HLA” genes. We also present novel approaches to identifying the causal variants in complex susceptibility loci and disease mechanisms.
KeywordsCeliac disease Autoimmune disease Immune-related disease Genome-wide association studies GWAS Pathway analysis
Immune-related diseases range from autoimmune diseases such as celiac disease (CD), rheumatoid arthritis (RA), multiple sclerosis (MS), and type I diabetes (T1D), to more chronic inflammatory disorders such as asthma and inflammatory bowel disease (IBD). Together these disorders now account for 5–10 % of all disease cases in Western countries (see elsewhere in this issue).
Celiac disease is one of the best-understood immune-related diseases. It is the most common food intolerance in humans, affecting at least 1 % of the Western population. It is a multifactorial disease caused by many different genetic factors that act in concert with non-genetic causes. A genetic association between CD and the HLA class II genes in the major histocompatibility complex (MHC) was documented almost 40 years ago . One of the most important triggering factors is dietary gluten, a storage protein present in wheat and related grains (hordein in barley, secalin in rye, and avedin in oats) (see elsewhere in this issue).
CD is an excellent model for studying the contribution of genetic factors to immune-related disorders because: (1) the environmental triggering factor is known (gluten), (2) as in other autoimmune diseases, specific HLA types (HLA-DQA1 and HLA-DQB1 in the case of CD) are critically involved (see elsewhere in this issue), (3) there is involvement of non-HLA disease-susceptibility loci, many of which are shared with other autoimmune diseases, (4) there is an elevated incidence of other immune-related diseases both in family members and individuals, and (5) both the innate and the adaptive immune responses play a role in CD .
Prior to genome-wide association studies (GWAS) the genetics of CD included candidate gene studies in case-control cohorts and linkage studies in multi-generation families and affected sibpairs . None of these studies have convincingly resulted in the identification of genetic factors beyond the well-established HLA-DQA1 and HLA-DQB1 genes. With the introduction of GWAS, the number of genetic factors implicated in CD has increased and 54 % of its heritability can now be explained. However, the methods for calculating the heritability are currently under debate , but CD remains the immune-related disorder with the best-characterized genetic component (e.g., MS 20 %, RA 16 %, CrD 23 %, UC 16 %, T1D 45 %) [5, 6].
GWAS in CD: yielding only the tip of the iceberg
GWA studies provide an unbiased approach for identifying genes and pathways involved in a certain phenotype, as they are not based on prior biological knowledge of the genes that they identify. Indeed, GWAS frequently identify genes and/or pathways that were not previously implicated in the phenotype of interest, for example, the unexpected role of the autophagy pathway in IBD ). Such an unbiased approach is highly beneficial as it generates new hypotheses that open up new avenues for investigation. Nevertheless, we must be careful in interpreting GWAS findings, as it is sometimes difficult to pinpoint the primary target of the genetic association. It is important to realize that the gene names of disease-associated loci are merely signposts. Often it is difficult to identify the single gene or gene variant providing risk or protection to a disease, because disease-associated loci often contain multiple genes and potential risk variants. Since individual genetic risk variants are usually common and have only a modest effect on disease risk, and because the cell or a sample of the tissue where the disease manifests is difficult to obtain for research purposes, it is difficult to investigate the consequence of the true causal risk variant. Despite these hurdles, GWAS have uncovered hundreds of loci associated to immune-related disorders, although these may represent only the tip of the iceberg [8, 9, 10]. This wealth of information will serve to formulate hypotheses that can be tested using experimental studies. Moreover, GWAS data can also be subjected to bioinformatic analysis to obtain more details about the tip of the iceberg and to reveal what still remains under the surface (see later sections in this review). To appreciate the complexity of GWAS, it is important to fully grasp the statistics involved. The interested reader can find an extensive description of the analytical methods in a review by Balding . Here, we will describe how GWAS have contributed to our understanding of the genetics of CD.
The first GWAS for CD was performed in 2007 on a relatively small cohort consisting of 778 CD patients and 1,422 controls, all from the UK . The subjects were tested for association to some 300,000 genetic variants in the human genome (so-called single nucleotide polymorphisms or SNPs) and the top 1,500 most associated SNPs were followed-up in replication cohorts consisting of 1,643 cases and 3,406 controls. Besides HLA, 13 regions in the genome were identified as harboring genes and genetic variants associated to CD [12, 13, 14]. Interestingly, the majority of the identified regions contained genes controlling immune responses, such as the IL2-IL21 locus on 4q27, thereby suggesting, for the first time, the potential role of IL2, a cytokine important for the homeostasis and function of T cells, and of IL21, a new member of the type 1 cytokine superfamily which regulates many other immune and non-immune cells. This first GWA study also revealed the phenomena of pleiotropy, i.e., genetic variants associated to CD are also associated with other immune-related diseases. For example, the IL2-21 locus is now a well-established disease susceptibility locus for T1D, RA, UC, MS, and systemic lupus erythematosus (SLE) [2, 15, 16, 17, 18, 19, 20, 21, 22].
A much larger GWAS on CD included more than 4,500 CD patients and nearly 11,000 controls from four different populations (UK, Italy, Finland, the Netherlands) and 550,000 SNPs . After replicating the most-significant 131 SNPs in seven follow-up cohorts of European descent, comprising almost 5,000 CD patients and more than 5,500 controls, 13 new regions in the genome were found to be associated with CD, bringing the total number of non-HLA associated loci to 26. The study by Dubois et al.  also showed that about 50 % of CD-associated SNPs affect the expression of nearby genes (so-called expression quantitative traits loci or eQTLs), indicating that the mechanism underlying CD is governed by a deregulation of gene expression.
The “resolution” of GWAS heavily depends on the number of samples included. One way to circumvent this limitation is to combine datasets and to perform a meta-analysis, as was done by Dubois et al. . Given the pleiotropic nature of the genetics underlying immune-related diseases, it also became possible to conduct cross-disease meta-analyses aimed at identifying additional shared susceptibility loci, as has been successfully demonstrated for CD. Two published GWAS datasets, one on CD  and one on RA , were pooled and the data obtained from the primary analysis was replicated using 2,169 CD cases (and 2,255 controls) and 2,845 RA cases (and 4,944 controls). In this meta-analysis, eight SNPs were replicated, including four SNPs mapping to loci that had not previously been associated with either disease (CD247, UBEL3, DDX6, and UBASH3A) and another four SNPs mapping to loci that had previously only been established in one of the diseases (SH2B3, 8q24.2, STAT4, and TRAF1-C5). The identification of these eight loci, together with six known loci (MMEL1/TNFRSF14, REL, ICOS/CTLA4, IL2/IL21, TNFAIP3, and TAGAP), brought the total number of non-HLA susceptibility loci shared between CD and RA to 14 . A similar study was performed for CD and CrD and identified four shared susceptibility loci . Although meta-analysis can help identify shared risk loci, it is important to realize that it is also possible to obtain contradictory data. Sometimes the association to the same loci is more complex and observed with different SNPs, or with identical SNPs but with the opposite allele. For example, the A allele of SNP rs917997 in IL18RAP is increased in frequency in CD cases, while the same allele is decreased in frequency in T1D patients . This could mean that the SNP is protective in one disease and a risk factor in the other.
One of the most surprising findings from this fine-mapping study was the observation that the PTPRK gene is the causal gene in the THEMIS/PTPRK locus . Immunological publications on the function of the THEMIS gene had suggested that it could be a very interesting candidate risk gene for CD, as it is an important regulator of thymic T cell selection . This observation suggested an important role for the thymus; this is an attractive theory given the lack of oral tolerance present in CD. However, there is only limited literature on the PTPRK gene, but knock-out of the Ptprk gene in rats leads to a Th cell deficiency . This example shows that GWAS results can easily be misinterpreted if attractive candidates are chosen without performing further validation. Immunochip analysis also identified 147 non-CD autoimmune disease loci with intermediate p values (in GWAS only SNPs with a P < 5 × 10−8 are considered true associations as they have reached “genome-wide significance”). It cannot be ruled out that these SNPs play a role in the disease process but that the study was underpowered to unequivocally prove involvement of these SNPs, suggesting that there might be dozens more genes contributing to CD.
Another approach for fine-mapping is imputation [29, 30]. Imputation is an in silico process in which the allelic combinations of non-genotyped SNPs in an individual are inferred (though not directly assayed) based on the haplotype structure present in large reference datasets, such as the ones provided by the 1000 Genomes Project (2010) and the International HapMap project [31, 32, 33]. A haplotype is the combination of alleles at adjacent locations (loci) on the chromosome that are transmitted together. After imputation, each dataset typically contains information on 2.5–4 million SNP variants per individual, including low-frequency variants that are not covered on a typical GWAS array . Subsequent association analysis on imputed genotypes may narrow down the region of association and help pinpoint the causative variant. As imputation is merely an in silico prediction of unknown genotypes based on the haplotype structure of a reference population, sufficient quality control measures are needed to exclude badly imputed SNPs and then the predicted genotypes need to be validated by other genotyping techniques or direct sequencing.
Genetic architecture of celiac disease
Now that a plethora of CD susceptibility factors has been identified, the challenge is to pinpoint the causal variants from each locus, and to prove that these causal variants affect the function of tissues and cell types involved in CD. Meeting this challenge requires a multidisciplinary approach, involving the generation and integration of bioinformatic, genetic, immunological and cell biological experimental data and clinical data. Below we will discuss the strategies that can be employed to meet this challenge, while focusing on the non-HLA CD susceptibility loci.
Expression QTL analysis can help to identify the causative gene in a locus with multiple candidates
It is difficult to identify the causal gene in a disease-associated locus that contains multiple candidate genes. The fact that the disease-associated SNP may not be the causal SNP, in strong LD with the true causal variant, it adds to the problem of identifying the causal gene. An elegant strategy that can be applied to narrow down the causal gene in a locus is to correlate genotypes with expression data. This approach has been coined expression QTL analysis [45, 46, 47]. Although eQTL analysis does not prove that the gene is the causal one in the locus, it can help in prioritizing genes for follow-up studies.
eQTLs come in two flavors: (1) cis-eQTLs in which SNPs affect expression of nearby genes , and (2) trans-eQTLs in which SNPs affect the expression of genes far away on the same chromosome or even on another chromosome . Dubois et al.  used a dataset consisting of genome-wide gene expression data and genome-wide SNP data of 1,469 human primary blood leukocytes to perform an eQTL analysis in CD. They showed that 20 out of the 38 (53 %) non-HLA CD susceptibility loci they investigated displayed significant eQTL effects. The most impressive eQTL effect was found for SNP rs917997 in the IL18RAP gene (P = 7.4 × 10e − 87) causing a 9-fold difference of IL18RAP expression between carriers of two wild-type alleles versus carriers of two risk alleles . This helped to pinpoint IL18RAP as the likely causal gene in a locus also harboring IL18R1, IL1RL1, and IL1RL2, since the latter three did not display a cis-eQTL effect. Altogether these findings indicate that the mechanism underlying CD is governed by a deregulation of gene expression. Other immune-related diseases show similar numbers of eQTLs for disease-associated SNPs, suggesting that this is a more general phenomenon : for example, 39 out of 71 CrD loci (55 %) show an eQTL effect , and 32 out of 53 T1D loci (60 %) .
The identification of eQTL effects in trans (trans-eQTLs) is much more difficult, presumably since these are more tissue specific and cell specific . Trans-eQTLs are of interest because they implicate biological processes by linking disease SNPs to the expression pattern of many genes, thereby potentially revealing disease-associated pathways. As an example, Fehrmann et al.  described the trans-eQTL effects of 1,167 published trait- or disease-related SNPs on gene expression in peripheral blood mononuclear cells (PBMCs) of 1,469 unrelated individuals. Trans-eQTL effects were observed on 113 genes, of which 46 could be replicated in a dataset obtained from monocytes of 1,490 different individuals, and 18 could be replicated in a dataset generated from subcutaneous adipose, visceral adipose, liver and muscle tissue from the same replication cohort. In addition, they identified 18 unlinked SNP pairs, associated with a single phenotype and affecting the regulation of the same trans-gene. The fact that singular genes are regulated in trans by multiple SNPs could indicate the importance of the trans-gene in the disease mechanism. In the same study, they also found that HLA SNPs are 10-fold enriched for trans-eQTL effects .
Applying pathway analysis to zoom in on gene function and disease mechanisms
Although the GWAS approach has its shortcomings, for instance it cannot pinpoint the causal gene in all loci, the approaches described above can help suggest causal candidate genes. A significant subset of the CD susceptibility loci can be associated with T cell biology, including REL, TNFAIP3, THEMIS/PTPRK, ETS1, RUNX3, TLR7/TLR8, BACH2, and IRF4 , but it is likely that other cell types are affected as well. Yet another strategy that can be applied to GWAS results is pathway analysis and quite a number of pathway analysis tools are now publicly available [51, 52]. In some of the pathway analysis approaches, human datasets have successfully been intersected with results obtained from model organisms such as yeast, worms and flies, to infer functional and physical interaction networks . Pathway analysis algorithms predict pathways based on connections between the genes in the query list that can be distilled from literature co-citation, gene ontology terms, co-expression, protein-protein interaction data, possession of common regulatory motifs or domains, tissue-specific co-expression, subcellular co-localization, and phenotypic profiling. All of these sources of information have been shown to provide useful data on biological function. Using these data and insights, systems biology approaches  can then be applied to unravel the role of the immune system in CD. While these approaches have so far been less often applied in mammalian systems, the recent availability of relevant datasets in humans and mice will facilitate such strategies.
In a recent review Wang et al. outlined the development of pathway-based approaches for GWAS and discussed their practical use and caveats . Many of the available tools examine whether a group of related genes in the same functional pathway are jointly associated with a trait of interest. Gene Relationships Among Implicated Loci (GRAIL) is a computational tool that takes a list of GWAS regions and predicts the likely causal gene in each locus using information from 250,000 PubMed abstracts . GRAIL can predict new loci and was successfully applied to RA, where it identified CD28, PRDM1, and CD2/CD58 as involved in the disease . Functional relationships between genes and their products can also be obtained from the Kyoto Encyclopedia of Genes and Genomes , the Biomolecular Interaction Network Database , the Human Protein Reference Database , the Gene Ontology (GO) Database , predicted (tissue-specific) phenome-interactome/expression networks [61, 62], the CCSB Interactome Database , and microarray co-expression datasets (GEMMA; http://www.chibi.ubc.ca/Gemma).
It has to be kept in mind that pathway analysis is based on the use of databases that contain experimental data and that the quality of this data is not equally high for every dataset included. Moreover, these tools favor the well-defined pathways  and lesser-studied genes may not be taken into account, making it more difficult to identify lesser known genes and pathways involved in disease etiology. Despite these shortcomings, pathway analysis approaches are becoming a mainstay in medical research and they have already demonstrated their usefulness in generating new hypotheses that can subsequently be tested.
We thank members of the Department of Genetics of the University Medical Center in Groningen for fruitful discussions. We would also like to thank Claudia M. González Arévalo and Harm-Jan Westra for help with the graphics, and Jackie Senior for editing the final text. This work was made possible by grants to CW from the Celiac Disease Consortium, an innovative cluster approved by the Netherlands Genomics Initiative and partially funded by the Dutch Government (BSIK03009), from the Netherlands Organization for Scientific Research (NWO, VICI grant 918.66.620), and from the Dutch Digestive Diseases Foundation (MLDS, WO 11-30).
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.