Introduction

Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), is a global public health problem with an estimated 1 in 3 to 1 in 4 people being infected by MTB [1, 2] . However, less than 10% of those infected by MTB develop active TB disease. In 2014, 9.6 million new cases were diagnosed, many (12%) due to coinfection with the human immunodeficiency virus (HIV), and there were 1.4 million deaths attributed to TB [3].

Initially, TB was considered to be a familial disease, as it was realized early on that close relatives of TB cases were at higher risk of TB than similarly close unrelated contacts. Twin studies estimated the concordance of TB among monozygotic twins to be between 32 and 62%, and between dizygotic twins to be between 14 and 18% [4, 5]. These observations of co-occurrence in families and concordance decreasing with degree of relatedness led to genetic studies to identify genes associated with TB. These included candidate gene studies (reviewed in [6,7,8]) and more recently genome-wide studies [9] (summarized below). As yet, no consistent genetic model underlying TB risk has emerged. This is likely to be due to multiple genes affecting TB risk, and/or each gene or set of genes promoting risk in different people. However, it is possible that several genes residing in the same or related pathways modulate TB risk. Therefore, in this review, we take a pathway approach to dissecting the host genetics of TB in order to shed light on potentially validated findings and pathways that could be exploited for functional studies. Using genomic studies to reveal putative pathways that are dysregulated in TB may be a particularly efficient means to develop more efficacious vaccines and treatments.

Epidemiology

The development of TB disease requires exposure, host genetic, and environmental factors that promote susceptibility. Exposure to an active TB case is required for transmission of MTB, and several epidemiologic studies have identified specific characteristics of the susceptible host and infectious TB case that facilitate the acquisition of MTB infection and development of TB disease [10,11,12]. In immunocompetent hosts, MTB organisms may be eliminated early by the innate immune system or, as in the majority of individuals, contained as a low-organism load in asymptomatic latent infection; latent infection develops into active disease in about 10% of people over their lifetimes. Because only a minority of MTB-infected individuals go on to develop active TB disease, host biologic factors are likely involved in progression vs. long-term control or disease.

HIV is a prominent factor involved in global TB epidemiology, but the Population Attributable Fraction (PAF) for HIV is only 11% of TB cases in the world globally [13]. Even in TB/HIV high-burden countries, the majority of TB patients are HIV-negative, e.g. in Guinea-Bissau, one of the WHO TB/HIV high-burden countries, HIV-1 coinfection is prevalent in one in four patients [14]. Hence, immunocompromising factors may play a role, but only a partial role in TB; host genetic factors may play a modifying role, although the PAF for host genetics on TB risk is unknown. The complexity of the host-pathogen interaction is considerable, involving modified antigen-expression by the dormant pathogen [15]. Proper control of the dormant MTB is, therefore, likely to be influenced by the genetic makeup of the host.

TB burden is clearly the worst in sub-Saharan Africa, East Asia, and to a lesser degree in the former Soviet Republics. The global distribution of TB is also manifested through the genetic diversity of MTB itself, as phylogenetic analysis of sequence variation in MTB shows geographic clustering of MTB lineages [16,17,18]. Clearly, both human genetic and MTB genetic variation have roles in the global distribution of TB.

Previous Genetic Studies of TB

Robust Genetic Associations—Candidate Gene Studies

In recent years, numerous candidate gene studies have been conducted to assess host genetic factors in TB risk. As with studies of other diseases, the early studies focused on one gene at a time, with only a few polymorphisms within each gene. Even after the advent of high-throughput genotyping technology, many studies continued in this vein, thus contributing limited amounts to our knowledge of the genetic underpinnings of TB. Given clear differences in incidence by geography, early studies that were only conducted in a handful of global settings could not necessarily be generalized across continents. However, with time, the geographic scope of research expanded, allowing increased potential to assess replicability across diverse ethnic backgrounds and to determine if universal susceptibility factors exist. We review here both robust candidate gene results and discuss why significant associations with candidate gene associations across multiple studies may not replicate well in other studies. Genome-wide association studies (GWAS) are also reported in the section below.

To identify robust risk loci for this review, we searched for genetic associations using the HuGE Navigator (https://phgkb.cdc.gov/HuGENavigator/home.do), a web-based application that queries population-based epidemiologic studies utilizing a machine learning algorithm to systematically select relevant studies. Indexing based on GeneID and MeSH terms, along with manual review, ensured appropriateness for inclusion in the database. Here, all abstracts identified by Phenopedia, an extension of HuGE Navigator that reports gene-disease associations organized by disease, were compiled for “Pulmonary Tuberculosis” (PTB). In addition, studies co-authored by members of our groups were also included [19,20,21,22,23,24,25,26,27,28,29]. Using manual review of each article, we identified the specific variant(s) and gene(s) associated with PTB risk, as well as the direction of effect. In addition, manual review excluded articles reporting solely genetic associations with PTB severity or extrapulmonary TB. Other exclusion criteria included the presence of extrapulmonary cases in the study population as well. We included genes that showed a significant association with PTB in at least five publications, whether significant single locus main effects, effects detected by interactions among loci (epistasis), diplotype associations, haplotype associations, or genetic risk score associations were reported (Table 1).

Table 1 Summary of candidate gene associations by type of association

A total of 15 genes met these criteria, with main effects representing the majority of the associations. However, all of the included genes also demonstrated other types of associations in addition to main effects. Additionally, genetic variants with main effects were reported according to their gene identity (Table 2). HLA-DRB1 and VDR had the most reported studies, with 15 unique studies recorded for each of these genes (Table 1), but aside from a few reports from South Africa, there were no reported associations between HLA and TB in Sub-Saharan African populations. A SNP in NRAMP1, rs17235416, was identified as associated with PTB risk in eight studies in a variety of ethnic populations. However, many of the specific genetic variants found to associate with PTB risk were reported by only one study. Notably, when multiple studies reported an association for a given polymorphism, the direction of effect was not always consistent between studies. For example, although rs4804803 in CD209 associated with PTB risk in both an Iranian and a West African population, the minor allele associated with protection and risk, respectively (Table 2). Similar to the main effects, most of the haplotype associations were reported by only one study. In addition, HLA-DRB1 had the most PTB-associated haplotypes reported (Table 3). However, relatively few associations for the genes in Table 2 were reported in African populations, where the burden of TB is greatest. This suggests one of several possibilities. The genetic effects could differ, environmental factors, including sanitation and malnutrition, involved in Africa are more pronounced thereby possibly overwhelming genetic effects, or as is the case, these populations have been severely understudied.

Table 2 Studies reporting significant main effects for candidate gene studies
Table 3 Studies reporting significant haplotype associations by gene

As noted above [109], there has been very limited replication of genetic associations across studies. Possible reasons for this include:

  1. 1)

    Population genetic differences across populations: It is well established that linkage disequilibrium (LD) patterns differ greatly globally. This is especially true in Africa, where LD generally is the lowest [110,111,112] and TB burden is the greatest. Therefore, if a SNP under examination is not the causal SNP, LD differences among populations will decrease the probability of detecting an association even with a common functional variant. This is compounded by studies that only examine a few polymorphisms within a gene and do not cover the genetic variation sufficiently. Therefore, the SNPs may not be highly correlated with the same functional variant in diverse populations. This is compounded when there is allelic heterogeneity across populations.

  2. 2)

    Differences in study design: Both case and control definitions vary widely by research setting. Some TB cases are defined strictly using microbiological confirmation, while others use a smattering of clinical criteria to define TB (reviewed in [109]). Controls are often defined even more inconsistently. Some controls are known to be exposed to an active TB case, and thus have the opportunity to develop TB; others are population-based and have not been clinically characterized at all. These potential misclassifications can result in a bias towards the null hypothesis. Some studies have been family-based, while others used a traditional case-control design. The advantage of family-based studies of TB is that exposure is more certain even if variable [113, 114].

  3. 3)

    Gene-gene and gene-environment interactions: A number of studies have suggested that TB susceptibility genes interact with each other (Table 1) [7, 20, 22, 25, 115, 116]. When interaction exists between two genes, it is possible that significant main effects may not be observed due to different allele frequencies at the second gene or environmental exposure [117]. This is a generic danger for studies that examine one gene at a time. Further complicating gene discovery, a few studies have indicated an interaction between human genes and MTB lineage [118, 119], further complicating identification of human genes. If significant genetic effects only exist in the context of a specific MTB lineage, non-replication across populations may be due to strong gene-environment factors.

Role of GWAS Vs. Candidate Gene Studies

To date, nine GWAS studies have been published in PTB (summarized in Table 1 of Uren et al. [9]). A 2010 study discovered an association in a gene desert on chromosome 18q11.2 in a combined Ghanaian, Gambian, and Malawian cohort [120•]. Availability of 1000 Genomes Project data allowed the authors to impute SNPs into the Ghanaian cohort and identify a genome-wide significant association for a locus 46 kb downstream of WT1 [121]. This association was replicated in Gambian, Indonesian, and Russian populations [121] and also by an independently conducted GWAS in an admixed population in South Africa [122]. The South African GWAS also detected loci on chromosomes 14q24.2 and 11q21-q22 that were just below genome-wide significance [122]. A case-control GWAS in a Russian population replicated the chromosome 11 locus, but not the chromosome 18 locus; it also detected significant association with the ASAP1 gene [123]. While a GWAS conducted in a Moroccan population did not detect any genome-wide significant associations, it replicated results from chromosomes 11 and 18 at a nominal significance (p < 0.05) [124]. A GWAS in an Indonesian population did not detect any loci that were significant after multiple testing correction, although it did identify suggestively associated loci involved in immune signaling; the previous GWAS associations were not explicitly tested in this study [125]. A GWAS conducted in Icelandic, Russian, and Croatian populations identified significant association with the HLA region [126], but did not replicate the loci on chromosomes 11 and 18. This is possibly because these previously associated variants are rarer in European populations. An analysis that stratified young versus older subjects with TB in Indonesia and Japan detected a significant locus on chromosome 20q12 associated in the younger onset group [127]. Finally, our GWAS of TB in HIV-infected subjects identified a significant association at 5q33.3, and haplotype analyses suggested that this association is due to the IL12B gene [128•].

Reasons for failure to replicate GWAS results include the reasons as noted above as for targeted candidate gene studies. Overall, such limitations have impeded progress in identifying key genetic associations for TB, but if we treat association studies as hypothesis-generating exercises, they still provide an important way to gain insight about TB pathogenesis; with careful study design, genome-wide studies can provide entry points for learning about TB biology.

Pathway Enrichment from Gene Associations

With few exceptions [21, 23, 129], most previous genetic association studies have not considered genes as part of pathways. Using the 15 genes and 36 genes identified by candidate literature review and GWAS literature review (the latter as described in Table 1 and do overlap with some of the candidate genes [9]), respectively, we used Ingenuity Pathway Analysis (IPA) (Qiagen) to determine if any pathways were enriched using all genes from our list. IPA mapped all genes from our provided list onto expert-curated canonical pathways from the Ingenuity Knowledge Base. A right-tailed Fisher’s exact test using a Benjamini-Hochberg multiple testing corrected p value of 0.05 determined whether the association between our literature-generated list and a particular canonical pathway was significant compared to random chance.

IPA identified several enriched pathways, the majority of which were involved in immune response or processes (Fig. 1; Supplemental Table 1). In total, 71 pathways were found to be significantly enriched. The most significant pathway (p < 10−10) was altered T cell and B cell signaling in rheumatoid arthritis, of which our literature-generated gene list represented 12.2% of the genes contained in the pathway (Fig. 1 and Supplemental Table 1). While this top pathway is labeled as being associated with rheumatoid arthritis (RA), it is well known that the same inflammatory pathways involved in RA are also involved in infectious disease response. The next few top pathways also reflected major components of the immune response to TB, such as T cells and other cells involved in pathogen recognition and the innate and adaptive immune response, and Th1 and Th2 cytokine response. Of course, as with all analyses of this type, pathway enrichment may be driven by common genes across pathways, and this is the case here as well.

Fig. 1
figure 1

Ten most significantly associated canonical pathways from IPA. IPA determined the association between canonical pathways and the provided literature-based gene list. Here, the top 10 associated pathways are shown. For each canonical pathway, the significance of association for a canonical pathway is depicted as the blue bar that is the -log(p value) for the Benjamini-Hochberg corrected p value. The ratio of genes represented within the provided list vs. all genes contained within a pathway are shown as a gold point for each pathway

While this recognition of genetic influences on the innate and adaptive immune response is not novel in itself, this analysis provides three novel insights into the genetics of response to TB. First, there may be other genes within these pathways, perhaps with smaller effect sizes, that are also important in TB genetics. Such smaller non-significant effect sizes would reflect the newly proposed “omnigenic” theory of complex trait genetics [130]. Second, we may find that the disruption of entire pathways is what contributes to TB susceptibility, not single genes. This line of thinking matches the approach of transcriptomic studies (summarized below). Third, it is noteworthy that a significantly associated pathway is diabetes signaling, and there is a well-established comorbidity between diabetes and TB. Therefore, approaches incorporating broader biological perspectives such as pathways may help to enhance our understanding of TB risk through how it may relate to other diseases.

Role of Transcriptomic Studies

Another approach to examine the role of host genomics in TB risk has been through gene expression studies, most of which have focused on identifying biomarkers that uniquely characterize TB cases [131, 132, 133•, 134,135,136,137,138]. As with genetic studies, these studies have differed widely in their choice of comparison group (household contacts, latent MTB infection (LTBI), and/or uncharacterized healthy subjects). They have also differed widely in the number of gene transcripts interrogated. Some studies reported a transcriptional signature involving the type I/II interferon pathways [131, 133•, 137, 138], while the others either did not compare their signatures to those previously published or did not describe the genes that composed these signatures. Some studies demonstrated that the signature observed in newly diagnosed TB cases normalized during or after treatment [131, 133•, 138], validating that those genes were expressed in newly diagnosed TB but did not establish a causal relationship. One prospective study identified a signature that occurred prior to the development of TB that resolved post-treatment, and went on to show that SNPs within CCL1, one of the differentially expressed genes, were associated with TB in a case-control study [133•]. The aforementioned studies used peripheral blood, which is attractive for a biomarker. Thuong et al. [139] took a different approach, by obtaining monocytes from circulating blood, stimulating them with MTB in vitro, and comparing the transcriptional responses to MTB between samples from cases and controls. This may more accurately reflect the in vivo response, but is not as translatable as a whole blood biomarker.

While the specific results of all these studies are quite different, two broad conclusions can be drawn. First, the usual focus on one gene at a time may be inadequate to accurately assess risk. The challenge of transcriptional studies is that this approach is not easily translatable to the field, even if it represents excellent biomarkers. Second, by examining the entire transcriptome instead of single genes, there are more likely to be consistent findings in pathways of interest than by simply assessing individual genes or transcripts. A more comprehensive review of transcriptional effects can be found in a review by Orlova and Schurr in this same issue.

Lessons Learned from Mendelian Immuodeficiencies and on Family Studies

Mendelian genetics has provided a proof of concept for susceptibility to mycobacterial infection and disease by demonstrating a role for gene variants. A monogenic etiology due to rare mutations with strong phenotypic effects has been observed in some children with severe or disseminated mycobacterial infections and can inform studies of the complexity of genetic risk for TB. To date, the immuno-deficiencies (PIDs) associated with increased risk for mycobacterial infection and disease (including TB) in children have been primarily associated with defects in the IFN-γ signaling pathway. [140]. This Mendelian Susceptibility to Mycobacterial Diseases, OMIM209950 (MSMD), is a rare (10−5–10−6) and highly heterogeneous condition identified in families with parental consanguinity [141]. Molecular studies of MSMD revealed 18 genetic forms that associated with IFNGR1, IFNGR2, STAT1, IRF8, CYBB, IL12B, IL12RB1, NEMO, and ISG15 genes, that all are part of the IL12/IFN-γ signaling pathways [140]. Mutations in these nine genes vary and exhibit incomplete penetrance that can translate into partial or complete loss of function. A general consistency among these phenotypes is impairment of IFN-γ function that affects activity of macrophages and dendritic cells for anti-mycobacterial defenses and antigen processing. Of relevance to more complex forms of resistance to TB as well as to therapeutic strategies, IL12B (IL-12 p40) or IL12RB1 (its receptor) mutations do not produce enough IFN-γ and benefit from human recombinant cytokine treatment. Not surprisingly, IFN-γ treatment has no effect in persons with mutations in IFNGR1 and IFNGR2 [142]. Clearly, molecular dissection of MSMD has been crucial in defining the central role of the IL-12/IFNγ axis and associated genetic network in controlling/determining mycobacterial infection and disease, and has also provided therapeutic opportunities.

Resistance Vs. Susceptibility as a Measured Phenotype

An alternative approach to studying the genomics of TB susceptibility that we have successfully taken is to study the genetics of extreme resistance instead of the development of disease. Specifically, HIV-infected individuals living in TB-endemic settings are at high risk for developing TB. The importance of this is borne out by the fact that TB is the number one killer of HIV-infected individuals. Therefore, individuals who are HIV-infected but either resistant to disease or resistant to MTB display protection from MTB infection and its resulting disease. Additionally, genes that confer this resistance may help identify novel therapeutics more easily than genetic factors that increase susceptibility; adding something is easier than subtracting it in vivo. We have used this strategy to differentiate between TB cases and controls despite immunosuppression to shed light on innate immune factors that influence resistance [128•]. We identified a locus near IL12B, a previously described candidate gene that has been shown to affect response to TB in mice as well as define highly susceptible families that have an IL12B knockout as well as among persons with MSMD. We have also studied persistently tuberculin skin test negativity despite close and prolonged exposure to active TB cases [143•, 144]—while the focus here is on infection and not disease, this resistance phenotype may help identify novel targets for vaccine development and host directed therapies [145•].

Necessary Future Directions

Need for Biological Validation

The majority of TB genetic association studies fail to identify the functional consequences of the associated polymorphisms. A few exceptions have yielded insight into TB biology. For example, a candidate gene study of CD1a illustrated how SNPs in this gene associate with markers of T cell response [146]. Similarly, a case-control study of TOLLIP showed association with TB, levels of mRNA expression of that gene, and IL6 production [147]. Gene expression studies indicated that the sodium butyrate pathway was associated with persistent TST negativity. Follow-up studies demonstrated that sodium butyrate and histone deacetylase inhibitors were associated with immunological response to MTB in vitro [145•]. While these studies had smaller sample sizes, they often included independent case-control replication sets with functional validation relevant to TB biology. Another approach to functional validation is eQTL studies, where the association between genetic variants and RNA expression levels are demonstrated. This has proven quite insightful into TB biology [148, 149].

Need for Novel Pathway Approaches

As mentioned above, examination of pathways instead of single gene effects has revealed novel therapeutic insight for TB [145•]. Network approaches may be better for discovery and characterization of gene-gene interactions and pathways associated with disease than standard analytical approaches, because variants may have unremarkable individual effects but instead affect the phenotype through gene-gene interactions [150, 151]. In fact, the integration of data generated across multiple platforms to disentangle multifactorial diseases was suggested years ago [152], but rarely executed [153, 154].

Need for Thorough Epidemiology/Clinical Characterization

As we have reviewed previously [109], variability in diagnostic criteria and documentation of MTB exposure in controls can contribute to the inconsistency in the TB genetics literature. This also likely contributes to the variability across GWAS studies summarized above—differences in diagnostic criteria, local prevalence of TB, and lack of documented exposure in controls all may explain the inability to replicate across studies. However, our GWAS of HIV-infected populations in Uganda and Tanzania showed great consistency across those two populations [128•], likely due to the thorough clinical characterization of the study subjects, and longitudinal follow-up of controls over several years to exclude the development of TB. Future studies should aim to use strict criteria such as the CDC/ATS criteria for diagnosis of TB [155] and include efforts to quantify exposure in controls. In our study of resistance to MTB infection, we carefully quantified exposure to an index TB case using extensive epidemiologic and clinical data [114]. An epidemiologic risk score can be used to determine whether control subjects were highly exposed to an infectious TB case, but resisted MTB infection or disease. There are other ways to demonstrate a high level of MTB exposure; for example, South African miners who work in poorly ventilated environments and are heavily exposed to MTB, can resist MTB infection and disease for years, if not decades [156].

The Importance of HIV

Although globally one in eight new TB cases are in people with HIV and one in four HIV-infected individuals die due to TB [2, 154, 157], examining the genetic risk for TB in HIV-infected individuals is uncommon. In fact, all of the aforementioned GWAS for TB excluded HIV-infected individuals, except ours [128•]. Thus, the majority of genetic studies of TB disregarded the potential modification of the relationship between host genetic variation and risk of TB by HIV infection. We have previously shown a significant interaction between TNFR1 alleles and HIV status [24], reinforcing the importance of assessing genetic risk of TB in the context of HIV. As previously shown, novel insights into TB genetic resistance and susceptibility can be gained by focusing on TB-HIV co-infected individuals; additional studies using this approach are warranted.

Need to Study Host-Pathogen Interaction in Diverse Populations

One additional set of genetic factors that may affect risk of developing TB are external to the human genome. Host-pathogen interaction or coevolution defined as “reciprocal, adaptive genetic changes in interacting host and pathogen species” [158] may affect MTB pathogenesis. Coevolution of host and pathogen has been hypothesized to account for some disease pathogenesis variation and the discrepancy between exposure and disease in several infectious diseases, including TB [159]. Specifically, some MTB strains are highly infectious—but only in certain hosts, where the pathogenicity is modulated by host genetic variation for which ethnicity has been used as a surrogate measure of host genetics [160, 161, 162, 163•, 164]. This is supported by the observation that there may be an association of MTB strains with host ethnicity [162]. The historical co-occurrence of humans and MTB and their co-migration out of Africa supports a long-standing relationship that provides the ideal condition for coevolution, leading to reduced pathogenicity [16]. Host-pathogen coevolution in TB is additionally supported by animal models [165, 166•]. The existence of co-evolved genes can significantly affect our ability to identify loci in both species, which interact to affect disease risk or severity. A gene in one study may associate with TB because of the MTB strain, while it may not in places where the MTB strain differs. In a similar vein, we have recently shown that the disruption of coevolution between Helicobacter pylori ancestry and human ancestry increases gastric disease severity in a diverse cohort of Colombians, providing proof of principle that coevolution can be detected genome-wide [167].

Conclusions

While numerous studies have been done on the human genetics of susceptibility to MTB, the inconsistencies across studies warrant new approaches to studying the genetics of TB. Family-based studies provide an opportunity for documented exposure and powerful genetic epidemiological approaches for rare variant mapping. Rigorous clinical characterization is essential. Where single gene approaches have fallen short, new approaches using pathway and/or polygenic/omnigenic approaches have the potential to reveal new insight into the complex host-pathogen interaction between MTB and humans. Finally, functional characterization and validation of findings in diverse populations are essential for genetic findings to turn into useful tools for risk stratification and development of novel vaccine and therapeutic targets.