Human genetics - linking inherited variation in DNA sequence with traits such as susceptibility to disease - provides prima facie evidence that a gene and a pathway are associated with a disease. The most recent wave of genomic technology has allowed human genomes to be scanned for variant DNA sequences (or alleles) in many people to determine which alleles are associated with a particular disease or phenotype of interest. Termed genome-wide association studies, or GWASs, this approach has identified hundreds of alleles that are associated with a variety of human traits [1, 2]. By most accounts, the GWAS approach has been very successful at identifying new regions of the genome (or loci) that are important in disease, even though the effect sizes of most alleles are modest.

The GWAS approach has been particularly successful at uncovering risk alleles for autoimmune diseases. Collectively, autoimmune diseases are common, affecting more than 5% of the adult population [3]. These diseases include rheumatoid arthritis (RA), type 1 diabetes (T1D), inflammatory bowel disease (IBD), systemic lupus erythematosus (SLE), multiple sclerosis (MS), psoriasis and celiac disease (among others). RA is a chronic inflammatory disease that destroys free moving joints. T1D is a form of diabetes that results from the destruction of insulin-producing beta cells of the pancreas. IBD is a group of inflammatory conditions of the colon and small intestine; the two major types are Crohn's disease and ulcerative colitis. In SLE, the immune system attacks a wide variety of organs, including the heart, joints, skin, lungs, blood vessels, liver, kidneys and nervous system. MS is an autoimmune disease in which the fatty myelin sheaths around the axons of the brain and spinal cord are damaged, leading to a broad spectrum of signs and symptoms. Psoriasis is a chronic disease in which the skin develops red, scaly patches, which is the result of areas of inflammation and excessive skin production. Celiac disease is an autoimmune disorder of the small intestine caused by a reaction to storage proteins (called glutens) found in cereal grains; the ensuing excessive immune reaction leads to an attack on the intestinal villi and tissue damage, resulting in malabsorption of nutrients.

So far, approximately 150 loci have been identified that increase risk of these autoimmune diseases [414]. For each disease, the strongest genetic risk factors reside within the major histocompatibility complex (MHC) region on chromosome 6 [15]. Most associated alleles in other regions are common in the general population, but increase the disease risk by only 10 to 20% (corresponding to an odds ratio (OR) of 1.10 to 1.20 per copy of the risk allele). (The OR is a measure of the strength of association; it refers to the ratio of the odds of an event occurring in one group (such as cases) to the odds of it occurring in another group (such as controls).) For any given autoimmune disease, the known genetic risk alleles explain between 10 and 20% of variance in disease risk, whereas more than 50% of disease risk is estimated to be heritable. The remaining 30% or so of unexplained genetic disease risk is termed the missing heritability.

The challenges now are, first, to find the causal mutation responsible for the signal of association; second, to understand which gene is disrupted by the causal mutation and how it is disrupted (that is, whether the mutation results in gain of function, loss of function, or a new function altogether); third, to understand which cell type and biological pathways are altered by these mutations; and finally to find additional mutations that explain the missing heritability [16]. The next wave of genomic technology - next-generation sequencing - will be a powerful ally in this effort. In particular, next-generation sequencing will help localize the causal mutation, as well as help identify rare alleles that confer risk of autoimmune disease.

Thus, an important question remains: what is the most appropriate scientific approach to understand function of risk alleles discovered in human genetics research? Is the mouse the most appropriate model organism, or do these genetic discoveries provide new resources to enable functional studies directly in human immune cells?

Here, I discuss the confluence of events that create a unique opportunity to use human subjects as the 'model organism' for the study of autoimmune disease pathogenesis. In addition to GWASs and next-generation sequencing, registries of blood draws from healthy, consenting human volunteers enable functional studies of genetic variants in a wide range of primary human immune cells, and human stem cell technology has advanced to the point at which induced pluripotent stem (iPS) cells can be derived from patients with specific mutations and differentiated into diverse immune lineages. These resources should allow investigators to understand the altered cellular state in diseases that are uniquely human, which should ultimately lead to new therapeutics to treat or prevent the devastating consequences of autoimmune disease.

Common SNPs and risk of autoimmune disease

In general, 'common' variants are those present at a frequency of over 1% in any one continental population (such as Europeans, Asians and Africans), whereas 'rare' variants are those present at a frequency of less than 1% in these populations [17]. This simple categorical distinction has been made in order to frame the genetic approach to discovering and testing DNA variants for their role in disease. For common variants, it is possible to screen a reference population to identify a catalog of variants (the discovery phase), and then test these variants in case-control collections using high-throughput genotyping technologies (the testing phase). A variety of resources have been developed to catalog common single nucleotide polymorphisms (SNPs), including the International HapMap Project [17, 18]. More recently, data from the 1000 Genomes Project [19] have begun to be used to catalog variants in the 1% frequency range.

In order to test whether these common SNPs are associated with risk of disease, commercial 'SNP chips' or arrays have been developed that capture most, although not all, common variation in the genome. These genotyping arrays can genotype hundreds of thousands of SNPs in a single experiment, at a cost of several hundred US dollars per sample. Contemporary GWASs use these arrays to measure the frequency of alleles in cases compared with controls. If the difference in allele frequency reaches a stringent level of statistical significance that corrects for the fact that there are about 1,000,000 independent common SNPs in the human genome (this significance level is about P < 5 × 10-8), then the allele is said to be 'associated' with disease.

There are approximately 10 million common SNPs in the human genome. A fundamental challenge in human genetics is to systematically test each of these 10 million common SNPs for its role in disease. Contemporary GWASs test several hundred thousand SNPs across the entire human genome, most of which are common (minor allele frequency over 5%) in the general, healthy population. To test the remaining over 9 million common SNPs, the GWAS approach relies on the correlation structure of nearby SNPs. That is, nine out of ten SNPs are highly correlated, and testing one SNP serves to tag the remaining nine nearby SNPs. This concept is known as linkage disequilibrium (LD).

The underlying rationale for the GWAS approach is rooted firmly in population genetics, as most of the differences between any two chromosomes are due to common SNPs [20]. On the basis of the hypothesis that disease alleles reflect the allelic spectrum of diseases in the general population, the risk of common diseases will be attributable in part to allelic variants that are also common.

GWASs have discovered about 150 loci that harbor SNPs associated with risk of autoimmune diseases. Several of the earliest GWASs that successfully identified common risk alleles were done in autoimmune diseases. Crohn's disease is an illustrative example. Before GWASs, only two loci outside the MHC were known to be associated with Crohn's disease risk [21]. In 2006, a GWAS of about 1,000 case-control samples identified a coding variant in the interleukin 23 receptor (IL23R) gene locus [22]. The landmark Wellcome Trust Case Control Consortium GWAS, published in 2007, included three autoimmune diseases (of the seven diseases studied): Crohn's disease, T1D and RA [23]. Since these initial GWASs, over 30 Crohn's disease risk loci [7], over 40 T1D risk loci [6] and over 25 RA risk loci have been discovered [24].

From these GWASs, an important theme has emerged: the overlap among the loci that confer risk of autoimmune disease. In 2008, Smyth and colleagues [9] studied the overlap between celiac disease and T1D. The study [9] found that nearly half of the about 30 risk loci contributed to both diseases, whereas the others seemed to be disease-specific. Other studies have compared and contrasted risk confirmed alleles for a variety of autoimmune diseases [9, 2527]. There is clear overlap for many of the known risk alleles, consistent with epidemiological data of disease clustering within families [28]. A partial list of loci associated with multiple autoimmune diseases is shown in Table 1.

Table 1 Loci associated with multiple autoimmune diseases

Missing heritability: next-generation sequencing and the role of rare SNPs

Although the number of loci associated with autoimmune disease is impressive, these loci cannot explain a sizeable fraction of disease risk. In fact, outside the MHC, common alleles can only explain 5 to 10% of disease risk associated with autoimmune disease. Considering that family studies have shown that more than 50% of autoimmune disease risk is thought to be genetic, the question arises as to why so much of the heritability is apparently unexplained by initial GWAS findings. One of the most frequently cited explanations for 'missing heritability' is that rare SNPs contribute substantially to disease risk, and contemporary GWAS arrays do not adequately capture rare variants [16].

There are two ways to test rare variants systematically for association with disease. First, it is possible to catalog low-frequency variants - those variants present in approximately 0.5 to 5% of control chromosomes - in a manner analogous to common variants. The only difference is that a greater number of subjects need to be included in the discovery effort. This is the main premise behind the 1000 Genomes Project [19]. Once discovered and catalogued, these low-frequency variants could be genotyped in a high-throughput manner using genotyping arrays.

The second approach is to couple the discovery and testing phases into a single experiment. That is, direct sequencing is done in case-control collections themselves, generating an unbiased catalog of DNA variants that are then tested for association with disease.

Until recently, direct sequencing in large patient samples was cost prohibitive. Next-generation sequencing has been developed to sequence large regions of DNA - with the ultimate goal of sequencing the complete genome - in a high-throughput and cost-effective manner. In the near future, next-generation sequencing will probably be the technical method of choice for conducting GWASs.

From associated SNP to causal allele and causal gene

An important promise of human genetics is that GWASs offer an unbiased approach to discovering new pathways that cause disease. Towards this end, a major challenge is to take the expanding list of disease risk alleles and understand the effect on gene function. The first step is to identify which gene near the associated SNP has its function affected by the underlying causal mutation (which is rarely known). This step is critical, as the region of LD surrounding the associated SNP often contains more than one gene (although often there is one likely candidate gene from the known biology). A region of LD includes neighboring sequence in which a group of SNPs are highly correlated (for example, at a correlation coefficient of r2 > 0.80). Moreover, it is conceivable that the causal mutation exerts its effect at a distance (for example, by altering gene expression) or that the causal mutation is rare in the general population and located some distance from the associated SNP [29].

As shown in Figure 1, there are at least three general approaches to get from associated SNP to causal gene (and causal mutation). First, fine-mapping of the region of LD is performed using resequencing and dense genotyping. An allele is considered causal if it is predicted to alter function and if direct experimentation demonstrates altered function. An intriguing result from GWASs is that most associated SNPs lie outside coding regions, and most of the causal mutations probably also fall outside coding regions. It is likely that many causal mutations affect gene expression or mRNA splicing.

Figure 1
figure 1

From associated SNP to causal gene/mutation. There are at least three ways to go from an associated SNP in a GWAS to the causal mutation(s) and causal gene. The first is to perform dense genotyping to identify the set of common SNPs that yield the strongest signal of association, followed by hypothesis-driven functional studies. The second is to perform deep re-sequencing to search for rare mutations that are independent of the common mutation and that alter protein function. The third is to use bioinformatics approaches to establish connections among genes across associated loci.

One of the best examples was fine-mapping and functional studies of IRF5, a gene associated with SLE and other autoimmune diseases [30, 31]. IRF5 encodes a member of the interferon regulatory factor (IRF) family, a group of transcription factors with diverse roles, including virus-mediated activation of interferon and modulation of cell growth, differentiation, apoptosis and immune system activity. Studies have revealed three functional alleles of IRF5: an exon 1 splice site variant, a 30-bp in-frame insertion/deletion variant of exon 6, and a variant in a conserved poly(A)+ signal sequence that alters the length of the 3' untranslated region and stability of IRF5 mRNAs [30]. Haplotypes of these three variants define at least three distinct levels of risk to SLE. There is an approximately twofold increase in the level of risk between carriers of the highest and lowest risk haplotypes.

Second, candidate genes from a region of LD can be resequenced to search for independent, rare protein-coding mutations. The underlying hypothesis is that a true causal gene will harbor multiple risk alleles; at least one of these might be common (and identified by GWAS), whereas many others will be rare. Precedence for this hypothesis comes from studies of Mendelian disorders, for which disease can be caused by many different mutations to the same gene (genetic heterogeneity). In a study published in 2009 [32], the coding exons of six genes identified by GWASs of T1D were resequenced to search for independent rare mutations. Two rare SNPs in the interferon-induced helicase C domain-containing protein 1 (IFIH1) gene were identified that conferred protection from T1D. IFIH1 is a cytoplasmic protein that recognizes RNA of certain viruses and mediates immune activation. Following infection, the IFIH1 protein senses the presence of viral RNA in the cytoplasm, triggers activation of nuclear factor (NF)-κB and IRF pathways and induces antiviral IFN-β response. The non-synonymous SNP with the strongest association, rs35667974 (which causes the amino acid substitution Ile923Val), was observed on an estimated 3 out of 960 case chromosomes but 24 out of 960 control chromosomes (P = 0.00004); another SNP, rs35337543 (which affects a splice donor site), was observed on 7 case chromosomes and 23 control chromosomes (P = 0.005). Both SNPs were genotyped in more than 20,000 additional case-control samples: rs35667974 was present in about 1% of cases and 2% of controls (P = 2.1 × 10-16) and rs35337543 in 1% of cases versus 1.5% of controls (P = 1.4 × 10-4). Both mutations are predicted to be loss-of-function mutations, although why these mutations influence risk of T1D remains unknown.

The third approach is less direct, but nonetheless very powerful, especially when there are many loci associated with risk of disease. The underlying hypothesis is that there are a limited number of biological pathways that are altered to confer risk of disease and that true causal genes will be restricted to those specific pathways. Examples of such pathways include known signaling pathways (such as the NF-κB pathway and risk of RA [33]) or catabolic pathways (such as autophagy and risk of Crohn's disease [20]). The challenge of this computational approach is to define categories of pathways, as our understanding of many biological processes is incomplete. One successful approach has been to use information contained in PubMed abstracts to establish connections between gene loci [34]. This approach has been used to identify putative causal genes for RA and celiac disease [5, 13]. In the RA study, three loci were identified that contained the genes CD28, CD2/CD58 and PRDM1, respectively [5]. Both CD2 and CD28 are co-stimulatory molecules on the surface of T cells. PRDM1 (also known as BLIMP-1) is a transcription factor that regulates terminal differentiation of B cells into immunoglobulin-secreting plasma cells. Once these connections are established among risk loci, direct experimentation is still required to prove the pathways are critical to disease.

Resources to validate the biological effects of causal mutations

Once the causal gene and causal mutation(s) have been identified, the next major challenge is to understand the underlying biological pathways that lead to autoimmune disease. New resources now make it possible to study the effects of mutations linked to autoimmune disease directly in relevant human tissue.

Registries have now been established at academic medical centers to study the functional consequences of common genetic mutations in blood cells from healthy control subjects [35]. Human immune cells (such as B and T lymphocytes) are easily accessible through a simple blood draw. These immune cells are of direct relevance to pathogenesis of autoimmune diseases, as indicated not only by recent genetic studies but also by previous studies in patients with autoimmune diseases [36]. Human immune cells derived from healthy control subjects have been used successfully to gain insight into function of common mutations at several autoimmune genes. A missense mutation in the PTPN22 gene is associated with several autoimmune diseases. PTPN22 encodes a protein tyrosine phosphatase that is expressed in lymphoid tissues and implicated in T-cell activation [37]. Functional studies in T cells derived from healthy human participants have shown that the PTPN22 risk allele alters secretion of IL2 from T cells stimulated through the T-cell receptor [38]. Other autoimmune risk alleles have been studied in a similar manner: a common multiple sclerosis risk mutation at CD58 can explain about 40% of the variance of CD58 cell surface expression on peripheral blood mononuclear cells (PBMCs) [39]; and a common T1D mutation in IL2RA alters IL2RA cell surface expression on CD4+ memory T cells [40].

Another new approach is to generate iPS cells from patients who carry specific genetic mutations. First described in 2006 [41], several studies have shown that iPS cells can be derived from patients with with Mendelian disorders [42]. By definition, iPS cells are pluripotent and can be differentiated into any human cell type. Specific protocols are required to direct differentiation into a specific cell lineage. In the case of immune lineages, protocols have been developed to differentiate human embryonic stem (ES) cells into B cells, T cells, natural killer cells, and other immune lineages [4350]. Because of the similarities between ES and iPS cells, differentiation protocols developed in ES cells should be applicable to differentiation of iPS cells into these same immune lineages.

Whether iPS cells derived from patients with autoimmune disease will be useful for functional studies of human genetic mutations is a hypothesis that needs to be rigorously tested. Human iPS cells offer several theoretical advantages over primary human immune cells derived from healthy patients. First, although many immune lineages can be isolated from peripheral blood, many reside within lymph nodes and other privileged sites not accessible through the blood. Moreover, it is impractical to isolate more than a few immune lineages in the amount of blood drawn from a single individual at a single point in time. Second, in studies of primary human immune cells, it is important to investigate carriers and non-carriers of mutations on the same day to minimize technical variability. iPS cells have the theoretical advantage of repeated measurements under a set of controlled conditions. Finally, primary human cells have a limited lifespan in culture. As a consequence, it is difficult to manipulate primary cells with transfections and other cellular perturbations.

Most genetic discoveries have concerned the risk of disease overall, rather than relevant subsets of disease; this applies not just to autoimmunity but also to other diseases. As a consequence, a new challenge is to correlate genotype with clinically relevant phenotypes, such as response to therapy and disease severity. For genotype-phenotype correlation studies, the major bottleneck is setting up large registries of patients with biospecimens for genomic studies and detailed clinical data. Traditional patient registries and clinical trials - the workhorse for sample collection over the past decades - are unlikely to achieve the size required to obtain thousands of autoimmune patient samples for these studies. New approaches - next-generation registries - will be required to break this bottleneck. In theory, it should be possible to collect data as part of routine patient care. Increased use of electronic medical records [51] and new approaches to mining clinical data from such records [52] is one exciting approach to expanding sample collections.

Contemporary GWASs of common variants have identified approximately 150 loci that confer risk of common autoimmune diseases. Once the causal genes and causal mutations have been identified, the next challenges will be to understand the underlying biological pathways and to correlate genotype with clinically relevant phenotypes. New resources are now available to enable these translational immunology studies in humans. Over the next few years, great strides should be made towards accomplishing these ambitious yet attainable goals.