Introduction

The major histocompatibility complex (MHC) locus, also known as the human leukocyte antigen (HLA) locus, spans around 4 Mbp on the short arm of chromosome 6 (6p21.3; Box 1). Molecules encoded by this region are involved in antigen presentation, inflammation regulation, the complement system, and the innate and adaptive immune responses, indicating the MHC’s importance in immune-mediated, autoimmune, and infectious diseases [1]. Over the past 50 years, polymorphisms in the MHC locus have been shown to influence many critical biological traits and individuals’ susceptibility to complex, autoimmune, and infectious diseases (Boxes 2 and 3). In addition to autoimmune and inflammatory diseases, the MHC has recently been found to play a role in some neurological disorders [2,3,4,5,6], implicating autoimmune components in these diseases.

The genetic structure of the MHC is characterized by high levels of linkage disequilibrium (LD) compared to the rest of the genome, which means there are technical challenges in identifying MHC single nucleotide polymorphisms (SNPs), alleles, and amino acids. However, the recent availability of dense genotyping platforms, such as the custom-made Illumina Infinium SNP chip (Immunochip) [7], and of MHC reference panels has helped to fine-map the locus, improving our understanding of its disease associations and our ability to identify functional variants.

In this review, we discuss recent advances in mapping susceptibility variants in the MHC, using autoimmune and infectious diseases as examples (Boxes 2 and 3). We also discuss the relationships between the MHC variants involved in both autoimmune and infectious diseases and offer insights into the MHC-associated immune responses underlying disease onset and pathogenesis. Finally, we discuss future directions for studying genetic variation in the MHC and how learning about the variation at this locus will aid understanding of disease pathogenesis.

Advances in mapping susceptibility variants in the MHC locus

Several computational and empirical challenges complicate the mapping of MHC susceptibility variants. One fundamental challenge is that the MHC has many sequence and structural variations [8], which differ between populations and complicate haplotype inference. Another is that high and extensive LD in the locus makes it difficult to identify causal and independent loci. Non-additive allelic effects in the MHC, and epistatic effects between the MHC and other loci, can also complicate inference of the underlying haplotype structure and disease susceptibility variants.

In recent years, large volumes of sequencing data have made it possible to impute MHC variation on a wide scale, thereby improving our understanding of variability at this locus and of the haplotype structures and enabling reference panels to be created. The availability of accurate reference panels and a large number of genotyped individuals has allowed the identification of independent variants and improved our understanding of their contribution to disease heritability and pathways underlying disease biology [9, 10].

Advances in laboratory-based mapping of MHC variation

Increased throughput, accuracy, and read length in next-generation sequencing (NGS) technologies, as well as the development of user-friendly bioinformatics tools, have enabled higher resolution MHC typing [11]. For instance, whole-genome sequencing (WGS) was successfully used to type HLA-A alleles at full resolution in 1070 healthy Japanese individuals [12] and to fully evaluate HLA-E variability in West African populations [13]. However, the main problem with MHC sequencing using current technologies is the relatively short read lengths, which limit the amount of allelic data that can be generated at a high resolution. Long-range PCR amplification approaches, such as the use of PacBio systems for single molecule real-time sequencing, significantly increase read-length and the resolution for typing MHC alleles [14]. In a comparison of MHC typing in an Indian population using sequence-specific primers, NGS (Roche/454) and single molecule sequencing (PacBio RS II) platforms, higher resolution typing was achieved for MHC class I (HLA-A, HLA-B, and HLA-C ) and class II genes (HLA-DRB1 and HLA-DQB1) using the PacBio platform, with a median read length of 2780 nucleotides [15].

High-density SNP panels, such as the Immunochip platform [7], which has been widely implemented in immunogenetics studies, are a cheaper, faster, and easier alternative to genotyping than direct MHC typing and NGS methods. The Immunochip contains a dense panel of SNPs from the MHC locus, which enables missing classic MHC variants to be inferred in silico, where the imputation is based on the haplotype structure present in large reference panels (Fig. 1). This fine-mapping approach has been used for several autoimmune and inflammatory diseases (Table 1) and for a few infectious diseases (Additional file 1), thereby allowing comprehensive interrogation of the MHC. Moreover, population-specific reference panels made by deep sequencing and used to impute genotypes allow identification of very rare variants and novel single-nucleotide variants in the human genome. This is illustrated by a recent study in which the authors first built a Han Chinese MHC-specific database by deep sequencing the region in 9946 patients with psoriasis and 10,689 healthy controls, and then used this reference panel to impute genotype data to fine-map psoriasis-associated variants [16]. Notably, functional variants in non-coding regions can be identified, as shown in a Japanese cohort of 1070 healthy individuals [12]. These variants would be impossible to discover using SNP microarrays or low coverage sequencing on the same sample size (Fig. 1, Table 1).

Fig. 1
figure 1

Major histocompatibility complex imputation. A reference cohort of subjects for whom both genetic information and classic human leukocyte antigen (HLA) typing is available can be used to infer the missing (untyped) genotypes and amino acids in a discovery cohort. This allows imputed variants to be tested for their associations with a disease of interest. The figure shows imputation points to classic alleles associated with celiac disease risk in the MHC region on chromosome 6. Y tyrosine, S serine, Q glutamine, T threonine, R arginine, E glutamic acid

Table 1 Major histocompatibility complex (MHC) associations to autoimmune diseases, as described by fine-mapping studies

MHC associations revealed by genome-wide association studies (GWAS) can often not be fine-mapped to a single allele at a single locus; rather they comprise independent effects from multiple loci (see “Role of MHC variants in human diseases”). The presence of these multiple, independent effects highlights the heterogeneous nature within and between diseases, which may lead to varying immunological responses. Fine-mapping has also shown that autoimmune diseases share MHC alleles and hence molecular pathways, which are likely to represent targets for shared therapies. For instance, the major associations within MHC class II across autoimmune diseases imply that modulating T-cell receptor (TCR) activation by using peptide-bearing MHC molecules on antigen-presenting cells (APCs) could be therapeutically useful [17]. Shared MHC genetic factors have also been observed between autoimmune and infectious diseases, suggesting that human genetic architecture has evolved in response to natural selection as determined by various infectious pathogens [18].

Advances in computational approaches for mapping MHC variation

Long-range LD between loci and SNP markers across the MHC offers an alternative approach to interrogate functional MHC variation through imputation. The development of different imputation tools using population-specific reference panels has enhanced the interpretation of genotype data derived from genome-wide platforms. MHC imputation is done using reference panels containing both genetic information and classic HLA serotyping, thus allowing identification of MHC allelic and amino acid variants. It is advantageous to impute allele and amino acid variants in the MHC because background sequence diversity causes the binary SNP concept to fail, technically speaking, while many SNPs have more than two alleles and various amino acids can be contained in the same position. For instance, six possible amino acid variants at position 11 in the HLA-DRB1 gene show the strongest association to rheumatoid arthritis (RA) [19]. Two of these (valine and leucine) confer susceptibility to RA, whereas the other four (asparagine, proline, glycine and serine) are protective.

Several tools allowing imputation of classic HLA alleles at four-digit resolution are now available for MHC imputation analysis; the most common are SNP2HLA [20], HLA*IMP:01 [21], and an improved HLA*IMP:02 [22]. HLA*IMP:02 outperforms HLA*IMP:01 on heterogeneous European populations and it increases the power and accuracy in cross-European GWAS [22]. Missing data are also better tolerated in HLA*IMP:02, while SNP genotyping platforms must be selected in HLA*IMP:01 [21, 22]. SNP2HLA not only imputes classic alleles but also amino acids by using two European reference panels, one based on data from HapMap-CEPH (90 individuals), and the other on the Type 1 Diabetes Genetics Consortium (T1DGC) study [20]. Another tool, HLA-VBSeq, allows imputation of MHC alleles at full resolution from whole-genome sequence data [23]. HLA-VBSeq does not require prior knowledge of MHC allele frequencies and can therefore be used for samples from genetically diverse populations [23]. It has successfully typed HLA-A alleles at full resolution in a Japanese population and identified rare causal variants implicated in complex human diseases [12].

One commonly used European reference panel for imputation is the T1DGC panel, which covers SNP genotyping and classic HLA serotyping information for 5225 unrelated individuals [20]. Similar population-specific reference panels have been developed for non-European studies to investigate the risk of psoriasis in Chinese populations [16] and of Graves’ disease and RA in Japanese populations. The panels have also been used to impute MHC alleles and amino acids for East Asian and Korean populations [24,25,26].

Using a single reference genome for regions like the MHC, which has substantial sequence and structural diversity, results in poor characterization. To counteract this, an algorithm was developed to infer much of the variation in the MHC; it allows genome inference from high-throughput sequencing data using known variation represented in a population reference graph (PRG) [27]. Specifically, the PRG constructed for the MHC combined eight assembled haplotypes, the sequences of known classic HLA alleles, and 87,640 SNP variants from the 1000 Genomes Project [28]. This approach is considered to be an intermediate step between de novo assembly and mapping to a single reference, but requires careful attention to the variation included in the PRG [27].

Despite the development of new tools to investigate MHC variation, the robustness of imputation depends largely on the reference panel and SNP selection. The frequency of alleles can differ between populations, thus highlighting the need to use population-specific reference panels to impute MHC alleles and amino acids. Additionally, the use of many samples is possible for analyzing the non-additive effects of MHC alleles on a wide scale, as described by Lenz et al. for celiac disease (CeD), psoriasis, and type 1 diabetes (T1D) [29]. These non-additive effects could explain our inability to identify susceptibility variants. However, one important limitation of existing imputation methods is that they are limited to the classic MHC alleles and their amino acids. Another limitation is that accuracy is lower for low frequency or rare variants [20, 30]; this can be improved by increasing the reference panel size, together with the use of deep sequencing data. Ascertainment bias and lower LD also make it challenging to impute MHC variants in some non-European populations, such as Africans.

MHC genetic variation mediates susceptibility to a wide range of complex diseases, including infectious and autoimmune diseases. The large volume of data generated by recent GWAS has provided an excellent opportunity to apply imputation tools used to fine-map MHC associations to classic alleles and amino acids, as described below for autoimmune diseases. Overall, MHC imputation has proved to be a robust and cost-effective way to identify causal genes underlying disease pathogenesis. Ultimately, knowing the causal genes will help explain disease heritability and lead to a better understanding of the molecular pathways involved in disease pathogenesis. Such work helps to pinpoint potential therapeutic targets.

Role of MHC variants in human diseases

Insights into MHC susceptibility for autoimmune diseases: fine-mapping results, epistasis, and disease biology

Associations between the MHC and autoimmune diseases reported in the 1970s were some of the earliest described genetic associations [31, 32], and they remain the strongest risk factors for autoimmune diseases. After the development of wide-screen genotyping platforms and imputation pipelines, MHC imputation and fine-mapping were performed in European and Asian populations for most common autoimmune diseases, including RA [19, 25, 33, 34], CeD [35], psoriasis [36], ankylosing spondylitis (AS) [37], systemic lupus erythematosus (SLE) [33, 38,39,40,41], T1D [42, 43], multiple sclerosis (MS) [44, 45], Graves’ disease [24], inflammatory bowel disease (IBD) [46], and dermatomyositis (DM) [47]. Table 1 shows the main associated variants and independently associated loci for autoimmune diseases.

In 2012, a pioneering MHC fine-mapping study, performed in individuals of European ancestry with RA [19], confirmed the strongest association with the class II HLA-DRB1 gene, as well as other independent associations. Previously an increased risk of RA was reported for a set of consensus amino acid sequences at positions 70–74 in the HLA-DRB1 gene, known as the “shared epitope” locus [48]. The imputed data revealed the most significant associations were with two amino acids at position 11, located in a peptide-binding groove of the HLA-DR heterodimer. This suggested a functional role for this amino acid in binding the RA-triggering antigen. Similar fine-mapping studies followed for other autoimmune diseases (Table 1).

In general, in most autoimmune diseases, fine-mapping strategies have confirmed the main associated locus reported by serotype analysis within a certain MHC locus. Such strategies have also allowed identification of specific allelic variants or amino acids, as well as independent variants in different HLA classes. For instance, in CeD, the strongest association was with the known DQ-DR locus, and five other independent signals in classes I and II were also identified. CeD is the only autoimmune disease for which the antigen, gluten, is known and well studied. Gluten is a dietary product in wheat, barley, and rye. It is digested in the intestine and deamidated by tissue transglutaminase enzymes such that it perfectly fits the binding pockets of a particular CeD-risk DQ heterodimer (encoded by the DQ2.2, DQ2.5, and DQ8 haplotypes). This association was confirmed by MHC fine-mapping, which indicated roles for four amino acids in the DQ genes with the strongest independent associations to CeD risk [35]. Similarly, the main associations were determined for T1D, MS, and SLE within the MHC class II locus (the associations for these three diseases are to a particular HLA-DQ-DR haplotype), and there are also independent, but weaker associations with the class I and/or III regions. In DM, fine mapping in an Asian population identified MHC associations driven by variants located around the MHC class II region, with HLA-DP1*17 being the most significant [47]. In contrast, the primary and strongest associations in psoriasis and AS were to MHC class I molecules, while independent associations to the class I locus were also reported for IBD and Graves’ disease. Class III variants are weakly implicated in autoimmune diseases, but several associations in the MHC class III region were seen for MS; for instance, the association to rs2516489 belonging to the long haplotype between MICB and LST1 genes. The association signal to rs419788-T in the class III region gene SKIV2L has also been implicated in SLE susceptibility, representing a novel locus identified by fine-mapping in UK parent–child trios [39]. An independent association signal to class III was also identified (rs8192591) by a large meta-analysis of European SLE cases and controls and, specifically, upstream of NOTCH4 [40]. However, further studies are needed to explain how these genetic variations contribute to predisposition to SLE.

In addition to identifying independent variants, MHC fine-mapping studies permit analysis of epistatic and non-additive effects in the locus. These phenomena occur when the effect of one allele on disease manifestation depends on the genotype of another allele in the locus (non-additive effect), or on the genotype of the “modifier” gene in another locus (epistasis). Non-additive MHC effects were established in CeD, in which knowing gluten was the causal antigen offered an advantage in investigating the antigen-specific structure of the DQ-heterodimer. CeD risk is mediated by the presence of several HLA-DQ haplotypes, including the DQ2.5, DQ2.2, and DQ8 haplotypes, which form the specific pocket that efficiently presents gluten to T cells. These haplotypes can be encoded either in cis, when both DQA1 and DQB1 are located on the same chromosome, or in trans, when they are located on different chromosomes. Some DQ allelic variants confer susceptibility to CeD only in combination with certain other haplotypes, forming a CeD-predisposing trans-combination. For example, HLA-DQA1*0505-DQB1*0301 (DQ7) confers risk to CeD only if it is combined with DQ2.2 or DQ2.5, contributing to the formation of susceptible haplotypes in trans. In particular, DQ7/DQ2.2 heterozygosity confers a higher risk for CeD than homozygosity for either of these alleles, and is an example of a non-additive effect for both alleles.

Unlike CeD, the exact haplotypes and their associated properties remain unknown for most other autoimmune diseases; therefore, analyzing non-additive effects might yield new insights into potentially disease-causing antigens. Lenz et al. provided evidence of significant non-additive effects for autoimmune diseases, including CeD, RA, T1D, and psoriasis, which were explained by interactions between certain classic HLA alleles [29]. For instance, specific interactions that increase T1D disease risk were described between HLA-DRB1*03:01-DQB1*02:01/DRB1*04:01-DQB1*03:02 genotypes [49] and for several combinations of the common HLA-DRB1, HLA-DQA1, and HLA-DQB1 haplotypes [43]. In AS, epistatic interaction was observed for combinations of HLA-B60 and HLA-B27, indicating that individuals with the HLA-B27+/HLA-B60+ genotype have a high risk of developing AS [50]. Moreover, a recent study in MS found evidence for two interactions involving class II alleles: HLA-DQA1*01:01-HLA-DRB1*15:01 and HLA-DQB1*03:01-HLA-DQB1*03:02, although their contribution to the missing heritability in MS was minor [44].

Epistatic interactions between MHC and non-MHC alleles have also been reported in several autoimmune diseases, including SLE, MS, AS, and psoriasis. For instance, in a large European cohort of SLE patients, the most significant epistatic interaction was identified between the MHC region and cytotoxic T lymphocyte antigen 4 (CTLA4) [9], which is upregulated in T cells upon encountering APCs. This highlights that appropriate antigen presentation and T-cell activation are important in SLE pathogenesis [9]. Notably, interactions between MHC class I and specific killer immunoglobulin receptor (KIR) genes are important in predisposition to autoimmune diseases such as psoriatic arthritis, scleroderma, sarcoidosis, and T1D [51,52,53,54]. KIR genes are encoded by the leukocyte receptor complex on chromosome 19q13 and expressed on natural killer cells and subpopulations of T cells [55]. Finally, epistatic interactions between MHC class I and ERAP1 have been described for AS, psoriasis, and Behçet’s disease [10].

Association of novel MHC variants and identification of interaction effects within the MHC are increasing our understanding of the biology underlying autoimmune and inflammatory diseases. Fine-mapping the main associated locus within HLA-DQ-DR haplotypes has allowed determination of the key amino acid positions in the DQ or DR heterodimer. Pinpointing specific amino acids leads to a better understanding of the structure and nature of potential antigens for autoimmune or inflammatory diseases, and these can then be tested through binding assays and molecular modeling. The fact that these positions are located in peptide-binding grooves suggests they have a functional impact on antigenic peptide presentation to T cells, either during early thymic development or during peripheral immune responses [19]. In addition, analysis of non-additive effects in MHC-associated loci offers the possibility to identify antigen-specific binding pockets and key amino acid sequences. For example, identification of the protective, five-amino acid sequence DERAA as a key sequence in the RA-protective HLA-DRB1:13 allele, and its similarity to human and microbial peptides, led to identification of (citrullinated) vinculin and some pathogen sequences as novel RA antigens [56].

The identification of independent signals in MHC classes I and III for many autoimmune diseases implies that these diseases involve novel pathway mechanisms. For example, association of CeD to class I molecules suggests a role for innate-like intraepithelial leucocytes that are restricted to class I expression and that are important in epithelial integrity and pathogen recognition [57]. Class I associations to RA, T1D, and other autoimmune diseases suggest that CD8+ cytotoxic cells are involved in disease pathogenesis, as well as CD4+ helper T cells.

Discovering the epistatic effects of MHC and non-MHC loci can also shed light on disease mechanisms. For example, ERAP1 loss-of-function variants reduce the risk of AS in individuals who are HLA-B27-positive and HLAB-40:01-positive, but not in carriers of other risk haplotypes [37]. Similar epistatic effects were also observed for psoriasis, such that individuals who carry variants in ERAP1 showed an increased risk only when they also carried an HLA-C risk allele [58]. In line with these observations, mouse studies have shown that ERAP1 determines the cleavage of related epitopes in such a way that they can be presented by the HLA-B27 molecule [37]. Confirming that certain epitopes must be cleaved by ERAP1 to be efficiently presented by CD4+ and CD8+ cells will be a critical step in identifying specific triggers for autoimmune diseases.

The recent discoveries of genetic associations between MHC alleles and autoimmune diseases are remarkable and offer the potential to identify disease-causing antigens. This would be a major step towards developing new treatments and preventing disease. However, we still do not understand exactly how most associated alleles and haplotypes work, and extensive functional studies are needed to clarify their involvement in disease.

Explained heritability by independent MHC loci for autoimmune diseases

Heritability is an estimation of how much variation in a disease or phenotype can be explained by genetic variants. Estimating heritability is important for predicting diseases but, for common diseases, it is challenging and depends on methodological preferences, disease prevalence, and gene–environment interactions that differ for each phenotype [59]. It is therefore difficult to compare heritability estimates across diseases. Nevertheless, for many diseases, estimates have been made as to how much phenotypic variance can be explained by the main locus and by independent MHC loci [29].

For autoimmune diseases with a main association signal coming from a class II locus, the reported variance explained by MHC alleles varies from 2 − 30% [9]. The strongest effect is reported for T1D, in which the HLA-DR and HLA-DQ haplotypes explain 29.6% of phenotypic variance; independently associated loci in HLA-A, HLA-B, and HLA-DPB1 together explain about 4% of the total phenotypic variance, while all other non-MHC loci are responsible for 9% [60]. Similarly, in CeD, which has the same main associated haplotype as T1D, the HLA-DQ-DR locus explains 23 − 29% of disease variance (depending on the estimated prevalence of disease, which is 1 − 3%), whereas other MHC alleles explain 2 − 3%, and non-MHC loci explain 6.5 − 9% [35]. In seropositive RA, 9.7% of phenotypic variance is explained by all the associated DR haplotypes, whereas a model including three amino acid positions in DRB1, together with independently associated amino acids in HLA-B and HLA-DP loci, explains 12.7% of the phenotypic variance [19]. This indicates that non-DR variants explain a proportion of heritability comparable to that in other non-MHC loci (4.7 − 5.5% in Asians and Europeans) [19]. The non-additive effects of DQ-DR haplotypes can also explain a substantial proportion of phenotypic variance: 1.4% (RA), 4.0% (T1D), and 4.1% (CeD) [29]. In MS, the major associated allele, DRB1*15:01, accounts for 10% of the phenotypic variance, whereas all the alleles in DRB1 explain 11.6%. A model including all of the independent variants (and those located in classes I, II, and III) explains 14.2% of the total variance in MS susceptibility [45].

In SLE, the proportion of variance explained by the MHC is notably lower, at only 2% [41], and is mostly due to class II variants. In IBD, the association with MHC is weaker than in classic autoimmune diseases, with a lower contribution seen in Crohn’s disease (CD) than in ulcerative colitis (UC) [61]. The main and secondary variants can now explain 3.1% of heritability in CD and 6.2% in UC, which is two to ten times greater than previously attributed by main effect analysis in either disease (0.3% in CD and 2.3% in UC for the main SNP effect) [46]. Among all the diseases discussed here, the main effect of the associated haplotype is far stronger than the independent effects from other loci (with the exception of IBD, in which the MHC association is weaker overall). However, independent MHC loci can now explain a comparable amount of the disease variance to that explained by the non-MHC associated genes known so far.

Insights into MHC susceptibility for infectious diseases: GWAS, fine-mapping results, and epistasis

In principle, an infectious disease is caused by interactions between a pathogen, the environment, and host genetics. Here, we discuss MHC genetic associations reported in infectious diseases from GWAS (Table 2) and how these findings can explain increased susceptibility or protection by affecting human immune responses. This is why certain MHC classes are important in infectious diseases. We note that fewer MHC associations have been found for infectious diseases than for autoimmune diseases, mainly because of the smaller cohort sizes for infectious diseases. Thus, extensive fine-mapping studies (and imputation) have yet to be performed, with the exception of a few studies on infections such as human immunodeficiency virus (HIV) [62], human hepatitis B virus (HBV) [63, 64], human hepatitis C virus (HCV) [65], human papilloma virus (HPV) seropositivity [66], and tuberculosis [67].

Table 2 Major histocompatibility complex (MHC) associations and risks for infectious diseases identified by genome-wide association studies (GWAS)

From a genetic viewpoint, one of the best-studied infectious diseases is HIV infection. MHC class I loci have strong effects on HIV control [62,69,70,, 6871] and acquisition [72], viral load set point [69,70,71], and non-progression of disease [73] in Europeans [69, 70, 72, 73], and in multi-ethnic populations (Europeans, African-Americans, Hispanics, and Chinese) [62, 68, 71]. A GWAS of an African-American population indicated a similar HIV-1 mechanism in Europeans and African-Americans: about 9.6% of the observed variation in viral load set point can be explained by HLA-B*5701 in Europeans [69], while about 10% can be explained by HLA-B*5703 in African-Americans [68]. In contrast, the MHC associations and imputed amino acids identified in Europeans and African-Americans were not replicated in Chinese populations, possibly because of the varied or low minor allele frequencies of these SNPs in Chinese people [71]. A strong association to the MHC class I polypeptide-related sequence B (MICB) was also revealed by a recent GWAS for dengue shock syndrome (DSS) in Vietnamese children [74]. This result was replicated in Thai patients, indicating MICB can be a strong risk factor for DSS in Southeastern Asians [75].

HLA-DP and HLA-DQ loci, along with other MHC or non-MHC loci (TCF19, EHMT2, HLA-C, HLA-DOA, UBE2L3, CFB, CD40, and NOTCH4) are consistently associated with susceptibility to HBV infection in Asian populations [76,77,78,79,80,81,82,83]. Significant associations between the HLA-DPA1 locus and HBV clearance were also confirmed in independent East Asian populations [79, 81]. A fine-mapping study of existing GWAS data from Han Chinese patients with chronic HBV infection used SNP2HLA as the imputation tool and a pan-Asian reference panel. It revealed four independent associations at HLA-DPβ1 positions 84–87, HLA-C amino acid position 15, rs400488 at HCG9, and HLA-DRB1*13; together, these four associations could explain over 72.94% of the phenotypic variance caused by genetic variations [64]. Another recent study using imputed data from Japanese individuals indicated that class II alleles were more strongly associated with chronic HBV infection than class I alleles (Additional file 1) [63]. Similarly, the HLA-DQ locus influences the spontaneous clearance of HCV infection in cohorts of European and African ancestry, while DQB1*03:01, which was identified by HLA genotyping together with the non-MHC IL28B, can explain 15% of spontaneous HCV infection clearance cases [65]. HLA-DQB1*03 also confers susceptibility to chronic HCV in Japanese people [84]. A GWAS in a European population revealed that HPV8 seropositivity is influenced by the MHC class II region [85]. However, HPV type 8 showed a higher seropositivity prevalence than other HPV types at the population level [66]; this led to a limited power to detect associations with other HPV types. Fine-mapping using the same European population as in the GWAS [66] revealed significant associations with HPV8 and HPV77 seropositivity, but only with MHC class II alleles, not with class I alleles. This indicates a pivotal role for class II molecules in antibody immune responses in HPV infection. Notably in this study, imputation was performed using HLA*IMP:02 and reference panels from the HapMap Project [86] and the 1958 British Birth Cohort, as well as using SNP2HLA with another reference panel from the T1DGC. Both imputation tools provided comparable results, thus highlighting the important role of MHC class II alleles in antibody response to HPV infection [66].

A GWAS on leprosy in Chinese populations pointed to significant associations with HLA-DR-DQ loci [87, 88]; these results were replicated in an Indian population [89]. Fine-mapping the MHC showed that variants in HLA class II were extensively associated with susceptibility to leprosy in Chinese people, with HLA-DRB1*15 being the most significant variant [87]. HLA class II variants also influence the mycobacterial infection tuberculosis in European and African populations [67, 90]. Fine-mapping identified the DQA1*03 haplotype, which contains four missense variants and contributes to disease susceptibility [67]. A meta-analysis showed that five variants (HLA-DRB1*04, *09, *10, *15, and *16) increase the risk of tuberculosis, especially in East Asian populations, whereas HLA-DRB1*11 is protective [91].

Using a population from Brazil, the first GWAS on visceral leishmaniasis revealed that the class II HLA-DRB1-HLA-DQA1 locus had the strongest association signal; this was replicated in an independent Indian population [92]. This common association suggests that Brazilians and Indians share determining genetic factors that are independent of the different parasite species in these geographically distinct regions.

Finally, epistatic interactions between MHC class I alleles and certain KIR alleles (between KIR3DS1 combined with HLA-B alleles) are associated with slower progression to acquired immunodeficiency syndrome (AIDS) [93] and better resolution of HCV infection (between KIR2DL3 and its human leukocyte antigen C group 1, HLA-C1) [94].

Insights into the biology of infectious diseases

Associations with the MHC class I locus suggest a critical role for CD8+ T-cell responses in major viral infections such as HIV, dengue, and HCV. This critical role of CD8+ T-cell responses in HIV infection is reflected by the slow disease progression seen in infected individuals because of their increasing CD8+ T-cell responses that are specific to conserved HIV proteins such as Gap p24 [95]. Interestingly, five out of six amino acid residues (Additional file 1) identified as associated with HIV control [62] lie in the MHC class I peptide-binding groove, implying that MHC variation affects peptide presentation to CD8+ T cells. In particular, the amino acid at position 97, which lies in the floor of the groove in HLA-B, was most significantly associated with HIV control (P = 4 × 10−45) [62]. This amino acid is also implicated in MHC protein folding and cell surface expression [96]. An association found in severe dengue disease also underscores the role of CD8+ T cells in disease pathogenesis: class I alleles that were associated with an increased risk of severe dengue disease were also associated with weaker CD8+ T-cell responses in a Sri Lankan population from an area of hyper-endemic dengue disease [97]. In HCV, similar to the protective alleles against HIV infection [95], HLA-B*27 presents the most conserved epitopes of HCV to elicit strong cytotoxic T-cell responses, thereby reducing the ability of HCV to escape from host immune responses [98].

Associations between genetic variants in the MHC class II region and disease susceptibility imply that impaired antigen presentation or unstable MHC class II molecules contribute to insufficient CD4+ T-cell responses and, subsequently, to increased susceptibility to infections. For instance, the amino acid changes at positions of HLA-DPβ1 and HLA-DRβ1 in the antigen-binding groove that influence HBV infection may result in defective antigen presentation to CD4+ T cells or to impaired stability of MHC class II molecules, thereby increasing susceptibility to HBV infection [64]. CD4+ T-cell responses are also critical in mycobacterial infections, such as has been described for leprosy and tuberculosis [99, 100]. Notably, monocyte-derived macrophages treated with live Mycobacterium leprae showed three main responses that explain infection persistence: downregulation of certain pro-inflammatory cytokines and MHC class II molecules (HLA-DR and HLA-DQ), preferentially primed regulatory T-cell responses, and reduced Th1-type and cytotoxic T-cell function [99]. Macrophages isolated from the lesions of patients with the most severe disease form, lepromatous leprosy, also showed lower expression of MHC class II molecules, providing further evidence that defective antigen presentation by these molecules leads to more persistent and more severe M. leprae infection [99].

Recently, it has been shown that CD4+ T-cells are essential for the optimal production of IFNγ by CD8+ T-cells in the lungs of mice infected with M. tuberculosis, indicating that communication between these two distinct effector cell populations is critical for a protective immune response against this infection [101]. Impaired antigen processing and presentation from Leishmania-infected macrophages (which are the primary resident cells for this parasite) to CD4+ T cells could explain increased susceptibility to leishmaniasis [102]. The association between HPV seropositivity and the MHC class II region also suggests that class II molecules bind and present exogenous antigens more effectively to a subset of CD4+ T cells known as Th2. These Th2 cells help primed B lymphocytes to differentiate into plasma cells and to secrete antibodies against the HPV virus.

In support of the hypothesis that genetic effects on both CD8+ (class I) and CD4+ (class II) cells modify the predisposition to infections, it should be noted that some infectious diseases, such as HIV, HBV, HCV, and leprosy, show associations to more than one of the classic MHC classes and, in some cases, the associations differ between populations (Table 2). Moreover, consideration must be given to the differences between viral and bacterial genotypes in the same infection, which play a role in determining potentially protective effects. Overall, associations with multiple MHC loci reflect the complex and interactive nature of host immune responses when the host encounters a pathogen.

Relationship between the MHC variants involved in autoimmune and infectious diseases

Both autoimmune and infectious diseases seem to involve certain MHC classes (Fig. 2a), and only a few MHC alleles are shared between these two distinct disease groups (Fig. 2b). The identification of shared MHC variation has provided insight into the relationships between the MHC variants involved in autoimmune and infectious diseases and which have been uniquely shaped throughout human evolution [18].

Fig. 2
figure 2

Major histocompatibility complex allele associations with autoimmune and infectious diseases. a Abbreviations marked with an asterisk indicate the autoimmune disease showing the strongest association with the specific locus. b Single nucleotide polymorphisms (SNPs) and alleles in the major histocompatibility complex (MHC) shared between autoimmune and infectious diseases. The blue area shows MHC alleles located in the class I region and the green area shows those in the class II region. The blue arrows indicate either a protective effect of the genetic variants against the infectious disease or a slower progression to the infectious disease. The red arrows indicate increased susceptibility to the corresponding autoimmune or infectious disease. AIDS acquired immunodeficiency syndrome, AS ankylosing spondylitis, CD Crohn’s disease, CeD celiac disease, DM dermatomyositis, HBV hepatitis B virus, HCV hepatitis C virus, HIV human immunodeficiency virus, MS multiple sclerosis, Ps psoriasis, RA rheumatoid arthritis, SLE systemic lupus erythematosus, T1D type 1 diabetes, TB tuberculosis, UC ulcerative colitis, HPV human papilloma virus

Two hypotheses have been proposed to explain the relationships between the MHC variants involved in both groups of diseases. The first, known as the “pathogen-driven selection” hypothesis, states that pressure exerted on the human genome by pathogens has led to the advantageous selection of host defense genes and, subsequently, to much higher polymorphism in the MHC. This polymorphism has contributed to the development of complex immune defense mechanisms that protect humans against a broad range of pathogens. Thus, heterozygosity at MHC loci is evolutionarily favored and has become an efficient mechanism contributing to the highly polymorphic MHC (the “MHC heterozygosity advantage”) [103]. Two examples of MHC heterozygote advantage are HIV-1-infected heterozygotes at class I loci, which are slower to progress to AIDS [104, 105], and HBV-infected heterozygotes at class II loci, which seem more likely to clear the infection [106]. In addition, human populations exposed to a more diverse range of pathogens display higher class I genetic diversity than those exposed to a smaller range [107]. However, the true effect of infectious diseases on selection might be underestimated because of the heterogeneity of many pathogens and the changing prevalence of infectious diseases over evolutionary time.

Positive selection of the advantageous effect of MHC polymorphism in infections may also be accompanied by a higher risk of developing autoimmune diseases. For example, the non-MHC locus SH2B3 rs3184504*A is a risk allele for CeD but has been under positive selection because it offers the human host protection against bacterial infections [108]. To investigate whether other genetic variants in the MHC show this opposite direction effect between autoimmune and infectious diseases (Fig. 2b), we compared SNPs and alleles in the MHC identified by GWAS and fine-mapping studies on autoimmune diseases (Table 1; Additional file 2) with those identified in infectious diseases (Table 2; Additional file 1). On the one hand, HLA-B*27:05, which has one of the strongest associations to AS in the MHC (P < 1 × 10−2000) [37] and is present in all ethnic groups, increases AS risk. On the other hand, it also has a protective effect against HIV infection, showing a nominal significant value of 5.2 × 10–5 [70]. The second example of opposite allelic effect is the association between the rs2395029*G allele and susceptibility to psoriasis (OR = 4.1; P = 2.13 × 10–26) [109] and AIDS non-progression (P = 9.36 × 10–12) [69]. Located in the HLA complex P5 (HCP5), rs2395029 is a proxy for the HLA-B*57:01 allele [69], the strongest protective allele against AIDS progression [110]. Non-progressors carrying the rs2395029-G allele had a lower viral load than other non-progressors [73].

Another study showed that psoriasis patients carry the same genetic variants as HIV controllers/non-progressors and that they are particularly enriched for the protective allele HLA-B*57:01 (P = 5.50 × 10–42) [111]. Moreover, the intergenic variant rs10484554*A, which is in LD with HLA-C (r2 ≥ 0.8), was significantly associated with AIDS non-progression (P = 6.27 × 10–8) [73] and with susceptibility to psoriasis (OR = 4.66, P = 4 × 10–214) [58]. HLA-C*06:02 (equivalent to HLA-Cw6) was most strongly associated with susceptibility to psoriasis (OR = 3.26; P = 2.1 × 10–201) [36] and is also protective against HIV infection (OR = 2.97; P = 2.1 × 10 –19) [62]. The same allele has been associated with susceptibility to CD (OR = 1.17; P = 2 × 10–13) [46]. Interestingly, the role of MHC in HIV control also relates to the influence of MHC expression levels. For instance, rs9264942 shows one of the most significant genome-wide effects observed on HIV control [62, 69, 70]: it is located 35 kb upstream of the HLA-C locus (Table 2) and has been associated with high HLA-C expression, conferring protection against HIV infection [112]. Explaining this protective effect, HLA-C allelic expression was correlated with increasing likelihood of CD8+ T-cell cytotoxicity [112]. However, the −35 SNP is not a causal variant, but is in LD with a SNP at the 3′ end of HLA-C; this affects HLA-C expression by influencing binding of the microRNA Hsa-miR-148a [113]. Notably, high HLA-C expression has a deleterious effect by conferring risk for Crohn’s disease [113]. The potential mechanism by which HLA expression levels confer resistance to pathogens, and also lead to greater autoimmunity, could be through promiscuous peptide binding [114]. Lastly, HLA-DQB1*03:02 showed a dominant risk effect for MS (OR = 1.30; P = 1.8 × 10–22) [45], whereas it is a resistant allele against chronic HBV infection (OR = 0.59; P = 1.42 × 10–5) [63].

The second hypothesis states that pathogens can trigger autoimmunity, as suggested by epidemiological studies [115, 116]. For example, it has recently been shown that apoptosis of infected colonic epithelial cells in mice induces the proliferation of self-reactive CD4+ T cells that are specific to cellular and to pathogenic antigens [117]. Self-reactive CD4+ T cells differentiate into Th17 cells, which promote production of auto-antibodies and auto-inflammation, implying that infections can trigger autoimmunity [117]. Other mechanisms have been proposed, such as molecular mimicry, bystander activation, exposure of cryptic antigens, and superantigens [118]. Common genetic signatures between autoimmune and infectious diseases indirectly imply that pathogens can indeed trigger autoimmunity. In line with this second hypothesis, we have identified common risk factors between autoimmune and infectious diseases, such as the alleles: HLA-DRB1*15 for MS, SLE (Table 1), and leprosy (OR = 2.11; P = 3.5 × 10–28) [87]; rs9275572*C, located in HLA-DQ, for chronic HCV infection (OR = 0.71; P = 2.62 × 10–6) [84], and SLE (P = 1.94 × 10–6) [119]; HLA-DQB1*03:02 for MS (OR = 1.30; P = 1.8 × 10–22) [45] and pulmonary tuberculosis (OR = 0.59; P = 2.48 × 10–5) [67]; HLA-C*12:02 for UC (OR = 2.25; P = 4 × 10–37) [46], CD (OR = 1.44; P = 3x 10–8) [46], and chronic HBV infection (OR = 1.70; P = 7.79 × 10–12) [63]; and rs378352*T, located in HLA-DOA, for chronic HBV infection (OR = 1.32; P = 1.16 × 10–7) [78] and RA (OR = 1.24; P = 4.6 × 10–6) [25] (Fig. 2a).

Associations within the MHC region for several autoimmune diseases, such as RA, CeD, AS, T1D, Graves’ disease, and DM, and HBV infection are driven by variants and alleles around HLA-DPB1 (Table 1), implying that viruses like HBV could trigger autoimmunity. Although there is no convincing evidence, HBV and HCV are associated with extra-hepatic autoimmune perturbations [120, 121]. Lastly, the DQA1*03:01 allele, which contributes to tuberculosis susceptibility (OR = 1.31; P = 3.1 × 10–8) [67], is also a well-known risk factor for CeD as part of the DQ8 (DQA1*03-DQB1*03:02) and DQ2.3 (trans-DQA1*03:01 and DQB1*02:01) haplotypes [122]. DQA1*03 also increases susceptibility to T1D, RA, and juvenile myositis [123,124,125]. Overall, the direction of association is the same for most shared MHC class II loci, suggesting that bacteria and viruses can trigger immune responses. No viruses have been proven to cause an autoimmune disease thus far, but multiple virus infections could prime the immune system and eventually trigger an autoimmune response; this is a hypothesis that has been supported by animal studies on MS [126].

Conclusions and future perspectives

We have discussed recent advances in understanding the genetic variation in the MHC in relation to autoimmune and infectious diseases. However, confidence in the associations between MHC and infectious diseases is limited, mainly because of the relatively small patient cohort sizes available. Further limitations to identifying and replicating associations with infectious diseases include: strain differences, heterogeneity in clinical phenotypes, use of inappropriate controls (such as individuals with asymptomatic infections), and population-specific differences in allele frequency and/or haplotype structure. Finally, with the exception of a few described above, no imputation has been performed in most infectious disease studies. In certain populations, such as Africans, lower LD makes it challenging to perform MHC imputation.

Although application of a traditional GWAS is challenging for infectious diseases, other approaches may increase the power of genetic studies. For instance, a combination of transcriptional analysis and systems biology allowed the identification of a novel role for type I IFN signaling pathway in the human host immune response against Candida albicans [127]. The use of control subjects for whom it is known whether they clear the infection, and who come from the same hospital as patients, could be appropriate for infectious diseases so that co-morbidities and clinical risk factors are as similar as possible between groups. Overall, initiating collaborative efforts to increase patient cohort numbers, designing better studies by using more appropriate controls and more homogenously clinically defined patient phenotypes, and applying imputation using population-specific reference genomes would open new avenues to study the genetics of infectious diseases.

In contrast to infectious diseases, the added value of fine-mapping the MHC to pinpoint genetic risk factors for autoimmune disease has been well demonstrated by numerous studies. The associations that have been found in both European and Asian populations to the same amino acids by fine-mapping the MHC suggest that the same molecular mechanism is involved, despite the differences in MHC allele frequencies seen between these ethnic groups.

MHC-based imputation approaches using genotype data, along with the use of population-specific reference panels for imputing MHC alleles and amino acids, has allowed identification of the MHC variation associated with complex diseases. Although identification is challenging, genetic variation in the MHC is of critical importance for two reasons. First, it sheds light on the development of autoimmunity, given the two hypotheses discussed above (pathogen-driven evolutionary selection of protective genes or pathogens as triggers of autoimmunity), and second, it yields greater understanding of the complexity of the human immune system. This knowledge will ultimately permit the design of better prophylactic and therapeutic strategies to achieve more balanced patient–immune responses during treatment.