Introduction

Bacterial and viral infections are common causes of mortality worldwide1,2. As effective antimicrobial treatment is increasingly threatened by the spread of resistant pathogens, new strategies and alternative therapies must be explored to reduce the incidence and burden of these infections3. The epidemiological situation, exposure and virulence of the invading pathogen are important to determine the risk of acquiring transmittable diseases, such as respiratory tract infections (RTIs)4. Many patient-specific factors, such as older age, malignancies, chronic diseases, and immunosuppression, are also known to increase the incidence and severity of viral and bacterial infections5,6. Understanding the risk factors for infections is important in clinical practice to guide strategies for prevention and treatment.

Genome-wide association studies (GWAS) in large population cohorts whereby associations between phenotypes and genetic variants across the whole genome are examined is a powerful tool to discover genetic determinants of disease and to uncover novel biology7. To date, such data is scarce for infections, and consequently, the genetic variants predisposing for these diseases are largely unknown. A previous GWAS based on self-reported history of common infections in > 200,000 individuals identified multiple loci associated with disease, most notably in genes related to the immune response8. These results, along with studies of heritability of infectious diseases9, suggest that genetic determinants play an important role for patient susceptibility to bacterial and viral infections. Exploring such undiscovered host-specific factors could be important for detection of patients at increased risk of infection and provide an avenue to identify targets for new drugs.

In the present study, we assessed the impact of genetic variation on the incidence of bacterial and viral infectious diseases in a cohort of ~ 350,000 individuals using results of genome-wide genotype data and diagnose codes from hospital inpatient and death registries.

Methods

Participants and phenotypes

Data on genetic variation and incidence of infections were obtained for participants of white British ancestry from the UK Biobank longitudinal community-based cohort study. The UK Biobank protocol has been described previously10 and is available online at https://www.ukbiobank.ac.uk. In short, approximately 500,000 individuals aged 40–69 years were included in the study at multiple sites in the United Kingdom during 2006–2010. The participants are monitored with regard to lifestyle, health conditions and biomarkers.

This study included 337,484 UK Biobank participants (mean age of 57 years), of which 181,236 (53.7%) were female and 156,248 (46.3%) were male. Infections were defined using hospital inpatient data and causes of death based on the International Classification of Disease (ICD)-10 codes (Version: 2015, available at https://www.who.int). Clinically relevant entities, such as abdominal infections, RTIs and urinary tract infections (UTIs), were defined by review of all available ICD-10 codes. Since most acute infection episodes do not require more than one health-care visit, a single ICD-10 diagnose code on one occasion was sufficient for inclusion. Diagnoses indicating suspected or proven bacterial or viral infections were categorized to the different phenotype groups based on anatomical sites and pathogens (Table S1). In some cases, the same ICD-10 diagnose code was included in more than one phenotype. For example, B26 mumps was included in viral infections and B26.2 mumps encephalitis was included in central nervous infections. Both primary and secondary diagnosis codes were considered in the analysis of phenotype-genotype associations, and all individuals with an infectious disease event (before or after baseline) were considered cases for that phenotype.

Genotypes

The quality control (QC), phasing, and imputation was performed centrally as previously described11. In brief, the release included 488,377 samples genotyped on the Applied Biosystems UK BiLEVE Axiom Array (UKBL) or the Applied Biosystems UK Biobank Axiom Array (UKBB). The pre-imputation QC included removing markers with large genotype frequency differences due to batch, plate, array, or sex based on Fisher’s exact tests; departures from Hardy–Weinberg equilibrium tested in Fisher’s exact test; discordance across control replicates; and, an overall missing rate > 5% or an overall minor allele frequency (MAF) < 0.0001. Further, samples that were identified as heterozygosity or missingness outliers were excluded. After applying the QC filters, a total of 487,442 samples and 670,739 autosomal markers were phased with SHAPEIT3 and imputed with IMPUTE4 using a combination of the Haplotype Reference Consortium (HRC) 1.1, UK10K, and the 1000 Genomes Phase 3 reference panel. Imputed genotypes from the UK Biobank March 2018 release were used. Unrelated participants of self-reported white British ancestry and European ethnicity based on principal component analysis who passed the genotype QC and had not withdrawn consent at the time of analyses were included (n = 337,484). We included up to 16.5 M genetic markers with a minor allele count ≥ 20 in cases and controls, and a MaCH r2 ≥ 0.8. Eighteen phenotypes with ≥ 200 cases in the cohort remained and were tested in the GWAS analyses.

Association analyses

In total, 81,179 (24%) of the participants had at least one diagnosis indicating a bacterial or viral infection. The number of cases (before or after UK Biobank baseline) for each phenotype included in the GWAS is listed in Table 1. Assuming an additive model, we tested the association between the genotype dosages of each marker and the infection phenotype using logistic regression (Firth’s penalized logistic regression in case of non-convergence) using PLINK212. A linear regression with age, sex, PC1-40, and genotype batch (three levels including UKBL interim release, UKBB interim release, and UKBB second release) as predictors and all infections as the outcome was performed in all individuals included in the GWAS. From this model, all PCs up to PC23 reached P < 0.001 and were included as covariates in the GWAS. Associations with P values < 5e−8 were considered significant.

Table 1 ICD-10 codes and the number of cases and controls per phenotype included in the GWAS.

We identified regions containing one or more genome-wide significant SNPs by screening a window of 500 kb adjacent to the first genome-wide significant SNP on each chromosome sorted by genomic position. If no additional SNPs were identified, the region was limited to that specific SNP, and screening was continued at the next GWAS-significant SNP. If additional GWAS-significant SNPs were found, the window was expanded with 300 kb from the last SNP, and screened for additional GWAS-significant SNPs, until there were no more such SNPs within the next 300 kb. Within each region, the SNP with lowest P value was assigned as the index SNP. For each region, conditional association analysis was performed adjusting for all index SNPs found on the chromosome. This distance-based pruning followed by conditional analysis was repeated until no SNPs reached P value < 5e−8. Significant, independent loci with MAF ≥ 1% discovered in the GWAS were compared across the infection phenotypes in this study and all ICD categories (e.g., K57, J18) with ≥ 200 cases and other relevant phenotypes (e.g., smoking, body mass index [BMI]) within the UK Biobank cohort. Linear or logistic regression adjusting for the same covariates as in the infection GWAS was applied for 39 SNPs vs. 743 phenotypes yielding a Bonferroni corrected threshold for significance of P = 1.7e−6.

GWAS catalog and LocusZoom plots

Index SNPs across all traits were linkage disequilibrium (LD) pruned, based on r2 < 0.1 in 500 kb windows with LD data from European samples from 1000 Genomes phase 3 (v5), creating a list of independent loci for all phenotypes. GWAS catalog data (downloaded 2019-06-26, available at https://www.ebi.ac.uk) within 250 kb of each independent locus was extracted and pairwise r2 was calculated between each index SNP and catalog hit. For multi-allelic markers, r2 was calculated for the alternate allele with the highest allele frequency. LD calculations were not performed for markers that were not present in 1000G or monomorphic in the European subset. These variants are included in the tables, but with r2 set to missing. Distance to the nearest gene was calculated as the distance from the index SNP to the transcript start or end position (whichever was closest). Regional plots of the association test results were generated for significant loci using LocusZoom v1.413 using LD data and GWAS catalog annotations. In the interpretation of result, we focused on SNPs with MAF ≥ 1% and GWAS catalog hits with r2 ≥ 0.30 and distance ≤ 100 kb from the genetic variant identified in our study. Other hits or nearby genes located within 250 kb from the index SNP, are sometimes discussed if considered biologically relevant to the infectious phenotype. In these cases, the effect allele frequency (EAF), distance and r2 for that specific SNP are specified in the text.

Fine-mapping of the HLA region

Fine mapping of the human leucocyte antigen (HLA) region was performed due to the critical functions of HLA genes in the immune response, the highly polymorphic nature of the region and high LD between alleles at nearby loci. Imputation of all 11 HLA loci was done centrally using HLA*IMP:02 following the same pre-imputation QC as described for the genome-wide imputation11. Dosages for all possible alleles at each HLA locus were tested in separate logistic regression models adjusting for the same covariates as described for the GWAS. Non-tested alleles were assigned a dosage of 0. Only alleles with a minor allele count ≥ 20, calculated separately in cases and controls, were tested. After Bonferroni correction, associations with P < 1.6e−5 were considered significant.

Functional annotation

The Summary data-based Mendelian Randomization (SMR) approach14 was applied to determine whether associations between SNP and infection phenotype could be explained by known gene expression. SMR analysis was performed by jointly analysing the infection GWAS results and publicly available expression quantitative trait locus (eQTL) summary statistics, thereby assessing potential functional significance of the identified loci pointing to a causal gene. SMR 1.02 was used with the default settings including GWAS results with MAF ≥ 1%. To assess if the GWAS and eQTL association with the phenotype was due to a single shared genetic variant rather than multiple variants in LD with independent effects on the phenotype the heterogeneity in dependent instruments (HEIDI) test was applied14. Gene expression data from eQTL studies were obtained from the Genotype Tissue Expression project (GTEx) V7 release15 and LD data from 1000G phase 3 (v5) EUR was used. To limit the number of tests only gene expression in biologically plausible tissues were considered. Gene expression in the spleen and whole blood was considered potentially important for the immune defence and therefore relevant for all phenotypes. Also, specific tissues were selected for phenotypes where significant SNPs had been found in the GWAS (Table S2).

Heritability

Narrow-sense heritability (h2) explained by additive SNP effects was calculated using LDSC v.1.0.016 and observed scale heritability estimates are reported. Infection phenotypes with at least 5000 cases (corresponding to an effective sample size of ~ 20,000) were included using a subset of the SNPs with likely high imputation quality passing the quality filters described for the GWAS, MAF ≥ 1%, and inclusion in HapMap3. For comparison, we also estimated the SNP heritability by Haseman-Elston (HE) regression in GCTA 1.92.117. Directly genotyped SNPs with MAF ≥ 1% were used to construct the genetic relationship matrix. Results from the HE regression based on the cross-product with the standard error computed using the Jackknife approach are reported.

All methods were carried out in accordance with relevant guidelines and regulations.

Results

Genotype–phenotype associations

In total, 57 unique genome-wide significant loci were found across all phenotypes (Table 2, Fig. S1). Neither the QQ plots (Fig. S2) nor the genomic control lambda metrics (Table S3) indicated a major inflation of the association test statistics for any of the phenotypes. We detected significant variants in the HLA region on chromosome 6 (nearby genes; HLA-DQA1, HLA-DQB1, HLA-DQB1-AS1, HLA-DRB1, HLA-DRB5 and HLA-DRB6) for abdominal infections and RTIs (Figs. 1, 2A). Fine mapping of the HLA region revealed significant associations with HLA-DQA1, HLA-DRB1, and HLA-DRB4 locus alleles (Table S4). HLA-DQ and HLA-DR are major histocompatibility complex (MHC) class II molecules that play a key role in the adaptive immune response, especially against bacterial infections, by presenting pathogen antigens mainly to the CD4 + T helper cells18.

Table 2 Significant (P < 5e−08) SNPs and genetic loci associated with bacterial and viral infections in the GWAS.
Figure 1
figure 1

Manhattan plots showing associations of genetic variants with (A) abdominal infections and (B) respiratory tract infections in the UK Biobank cohort (n = 337,536). The asterisk * indicates that associations were found for multiple genes belonging to the HLA-DQ group (e.g., HLA-DQA1, HLA-DQB1). Negative log10-transformed P values for each SNP (y axis) are plotted by chromosomal position (x axis). The grey line represents the threshold for genome-wide statistically significant associations (P = 5e−08). Red points represent significant hits, and each significant locus is annotated with the nearest gene.

Figure 2
figure 2

Regional association and linkage disequilibrium plots for the (A) HLA-DQ* locus in relation to respiratory tract infections and the (B) ARHGAP15 locus in relation to abdominal infections. The y axis represents the negative log10 of the variant P values, and the x axis represents the position on the chromosome, with the name and location of genes shown in the bottom panel. The SNP with the lowest P value in the region is marked by a purple diamond. The colours of the other SNPs indicate the correlations of these SNPs with the lead SNP. Plots were generated with LocusZoom.

Abdominal infections

Twenty-six significant genetic variants were associated with abdominal infections (Table 2, Fig. 1A). The results for this phenotype were largely driven by ICD-10 code K57; intestinal diverticular disease and diverticulitis. A sensitivity analysis was performed where K57 was removed from the case definition of abdominal infections. With this updated definition of abdominal infections only three loci reached nominal significance (lead variants rs11428277, rs2049865, and rs377411728), while almost all loci reached P value < 5e−8 when tested for an association with K57 alone (data not shown). One locus (lead variant rs377411728) reached P value < 1e−5 for both K57 and the abdominal infection phenotype excluding K57.

The strongest hit was an intronic variant of the ARHGAP15 gene (chr2:rs6717024, P = 1.22e−34) (Fig. 2B). In an in vivo sepsis model, lack of ArhGAP15 (Rho GTPase-activating protein 15), which functions as a negative regulator of multiple neutrophil functions, induced cellular elongation but resulted in more efficient neutrophil migration, phagocytosis, and bacterial killing19. Based on these data, ARHGAP was suggested as a therapeutic target to enhance the antibacterial activity of white blood cells and decrease systemic inflammation in septic patients. CRISPLD2 (lead variant rs4782673, P = 1.37e−10), which is expressed in multiple tissues and leukocytes, has previously been associated with mortality in sepsis20. In a small case–control study, CRISPLD2 was reduced in patients with septic shock and showed a negative correlation with the bacterial infection biomarker procalcitonin21. In mice, administration of recombinant CRISPLD2 was protective for endotoxin shock, presumably through inhibition of the binding of bacterial lipopolysaccharide (LPS) endotoxins to target cells, and consequently reduced induction of the TNF-a and IL-6 cytokine production22.

Variants in the ARHGAP15 locus were associated with diverticular disease in a GWAS using the UK Biobank cohort23 and in cohort of Icelandic and Danish cases and controls24. In addition to ARHGAP15, most other variants, in or nearby genes SLC35F3, CALCB, COLQ, EFEMP1, LYPLAL1-DT, CRISPLD2, TRPS1, S100A10, ANO1, LINC01082, DISP2, CACNB2, BDNF, P2RY14, WDR70, ELN, FAM185A, ENPP2, ENTPD7, ABO (close to SURF6), PPP1R14A (close to SPINT2) and MIR2113, were also located closely to SNPs previously associated with diverticular disease23. The protein encoded by COLQ (lead variant rs7609897, P = 9.76e−14) influences smooth muscle motility and the neuromuscular junctions between nerve cells and muscle cells, suggesting a biological function in the development of intestinal diverticula. Variants in the COX15 locus (lead intronic variant, chr10:rs11428277; P = 1.69e−11) were previously associated with colorectal cancer25 and Crohn’s disease26. The COX15 protein is localized in the inner mitochondrial membrane and has a key function in the electron transport chain. Bacterial invasion in the intestinal mucosa secondary to inflammation or cancer is a plausible biological explanation for the observed associations.

Moreover, several genetic variants were located nearby SNPs or genes of potential importance to susceptibility to other types of infections, the host immune defence and other intra-abdominal conditions. A SNP in the EFEMP1 locus (lead variant rs1802575, 3’UTR, P = 1.56e−11), was previously associated with a history of childhood ear infection8. Decreased expression of EFEMP1 (epidermal growth factor-containing fibulin-like extracellular matrix protein) in hepatocellular cancer cells is a predictor of tumour spread and metastasis, and consequently worse prognosis27. Interestingly, EFEMP1 acts by promoting SEMA3B, which belongs to the semaphorin family of proteins that regulate multiple physiologic processes including the immune response and cell migration. Reduced levels of SEMA3B in fibroblast-like synoviocytes was found in patients with rheumatoid arthritis, suggesting a role also in the development of autoinflammatory disease28. Variants in the SLC35F3 locus (lead variant, chr1:rs4333882; P = 2.14e−14) have been reported associated with levels of the pro-inflammatory cytokine IL-629. The biological function of SLC35F3 is unknown, but IL-6 has a key role in the acute phase response to infections by stimulating the production of neutrophils. SNPs in or nearby TRPS1 (lead variant, rs2049865, P = 4.67e−10) were previously associated with white blood cells and cytokines30, and MIR2113 (distance 128 kb from rs9372625, P = 2.02e−9) with the composition of the gut microbiota31.

Respiratory tract infections

Seven independent loci were associated with RTI phenotypes (all RTIs, n = 2; bacterial pneumonia, n = 4; influenza and viral pneumonia, n = 1) (Table 2, Fig. 1B). The strongest hit associated with the combined phenotype of all RTIs (chr6:rs28752520, P = 1.77e−10) (Fig. 2A), located in the HLA-DQA1 locus, was previously found to be associated with blood protein levels32. Other variants close to our index SNP, but not strongly correlated, have also shown associations with common infections; plantar warts (distance = 0 kb, r2 = 0.47 according to the GWAS catalog database), childhood ear infection (18 kb, r2 = 0.021) and scarlet fever (43 kb, r2 = 0.026)8. A significant variant on chromosome 9 (position 128,648,077, P = 1.94e−10), previously reported to be associated with sleep duration33, is located near PBX3; variants in this locus have shown association with squamous cell lung carcinoma34. The gene product, pre-B-cell leukaemia transcription factor 3, induced inflammatory response in sepsis in a murine infection model by acting as a competing endogenous RNA for HMGB1 (high-mobility group protein 1)35. HMGB1 is produced by macrophages in response to bacterial infections, functioning as an endotoxin-induced cytokine mediator of inflammation, and has been proposed a potential therapeutic target for sepsis36.

Previous studies of schizophrenia37, cigarette smoking, chronic pulmonary disease38 and lung cancer39 reported associations with SNPs that were adjacent (< 10 kb), but not strongly correlated (r2 ≤ 0.246), to one of the SNPs associated with bacterial pneumonia (lead variant, rs77438700, P = 7.72e−10). Nearby genes of interest include CHRNA3 and CHRNA5; variants in this locus are associated with chronic obstructive pulmonary disease and lung cancer40. CHRNA3 and CHRNA5 encode the alpha-type subunit of a cholinergic receptor, which likely mediates the effects of nicotine on the brain. Due to their association with nicotine dependence, the causal variants at this locus probably serve as a determinant of smoking behaviour, subsequently increasing the risks of chronic lung disease and bacterial pneumonia.

Sepsis

The GWAS revealed only four rare variants associated with sepsis (Table 2) and no significant correlations were found in the GWAS catalog database. Our findings should be interpreted with caution due to the low frequency and limited sample size of this phenotype (n = 4840). Nearby genes of interest include SELE and SELL, which encode the leukocyte cell adhesion receptors Selectin E and L that are involved in leukocyte/endothelium interactions during interleukin-induced inflammation. SELE is associated with Leukocyte Adhesion Deficiency (LAD), a rare autosomal recessive disorder typically presenting with recurrent severe bacterial infections41. Selectin L facilitates entry of lymphocytes into the extracellular space42, which is an integral process in the immune response to sepsis. Another associated rare variant (EAF cases = 0.0027, lead SNP chr4:rs564716204, P = 1.90e−08) was located nearby LRBA. LRBA (LPS responsive beige-like anchor protein) deficiency is an autosomal recessive genetic disorder caused by mutations resulting in reduced expression and function of the cytotoxic T lymphocyte-associated protein 4 (CTLA4)43. This condition is associated with low levels of immunoglobulins (IgG, IgM, IgA), repeated infections due to impaired humoral immune response, and increased risk of autoinflammatory diseases (e.g., diabetes mellitus, inflammatory bowel disease).

Other phenotypes

Multiple loci, most of which are novel in the context of infectious diseases, were found for the remaining phenotypes: gastroenteritis (n = 6), heart infections (n = 4), sexually transmitted diseases (n = 1), skin infections (n = 1), specified viral infections (n = 1), UTI (n = 5) and urogenital (non-UTI) infections (n = 2) (Table 2). A genetic variant associated with skin infections in our study (lead SNP chr5:rs6595799, P = 2.39e−08) is highly correlated (r2 ≥ 0.9) and close to SNPs near LINC01184. LINC01184 is a long intergenic non-protein coding RNA that is differentially expressed in many types of cancers that has previously been reported associated with cancer44,45, blood cell traits46 and other phenotypes47. One of the strongest hits for heart infections in our study (chr14:rs182592259, P = 8.73e−09) was located near BDKRB2, which encodes the bradykinin B2 receptor that has a protective role in the development of hypertension and cardiovascular disease48, thereby potentially affecting also the vulnerability to infections.

Functional annotation

We identified a total of 91 colocalization events representing 4, 15 and 23 unique traits, tissues and genes respectively. PPP1R14A showed the strongest colocalization with abdominal infections, and colocalized with eQTL signals in both sigmoid and transverse colon tissue with lead variants in strong LD (r2 > 0.98; strongest association lead SNP rs4803934, SMR, P = 4.68e−10) (Table S5). Neighboring genes did not show a similar pattern of colocalization with the GWAS signal in this locus (Fig. 3). PPP1R14A, also known as CPI-17, belongs to the protein phosphatase 1 (PP1) inhibitor family which has a key role in the adjustment of smooth muscle contraction in response to physiological stimuli49. PPP1R14A has shown associations with different cancer types in prior large-scale GWASs, including colon and prostate cancer50,51. Evidence from these studies points to a transcriptionally mediated effect; imputed PPP1R14A expression, derived as a linear combination of cis genotypes associated with expression of the gene, showed association with prostate cancer in two independent cohorts51. This locus also shows association with diverticular disease23.

Figure 3
figure 3

eQTL colocalization between GWAS signal for abdominal infections and PPP1R14A expression in the colon. Neighbouring genes do not show the same colocalization.

We observed that several colocalization events for abdominal disease occurred with genes in the HLA region; SMR and HEIDI analyses identified HLA-DQA2, HLA-DRB6 and HLA-DRB1. Due to the complex LD structure in the HLA region, there is likely to be additional pleiotropy occurring with these discoveries. Although we performed HLA region fine-mapping to detect more closely resolved association signals, we did not perform SMR and HEIDI analyses with the fine-mapped data. Colocalizations for HLA regions were observed in 11 distinct tissues including whole blood and spleen; many tissues likely share eQTLs that underlie these results. We also observed a colocalization between ABO expression and abdominal infections in the adipose visceral omentum (lead SNP rs505922; P = 6.91e−06). Associations between blood types and infections were amongst the earlier identified associations between molecular traits and phenotypes52. There is evidence that individuals with different blood types have varying levels of susceptibility to acute pyelonephritis53, presumably mediated by the expression of receptors in the endothelium, and abdominal infections54. Finally, we observed colocalizations between abdominal infections and the colon expression of NOV (also called CCN3, lead variant rs61100635, P = 1.75e06) and DISP2 (rs2289328; P = 1.27e−06).

Heritability

The narrow-sense heritability on the observed scale was low (0–4%) for all phenotypes (Table S6) with the highest heritability found for abdominal infections. The difference in heritability could partly be related to the phenotype definitions; the phenotype of abdominal infections was more homogenous than RTIs, which included multiple infections of varying severity and pathogens. Moreover, the genetic component is likely higher in endogenous infections, such as abdominal infections, which are normally caused by bacteria of the host’s microbiome, compared to exogenous infections, including viral RTIs or gastroenteritis, which depend on exposure and acquisition of a transmittable pathogen.

Phenome-wide associations in the UK Biobank

The analysis of shared significant loci across infectious and non-infectious phenotypes in the UK Biobank cohort revealed associations between one of the SNPs identified for abdominal infections (rs570640158) located in the HLA region, and phenotypes related to infection, inflammatory and autoimmune diseases, including CRP, asthma (ICD-10 code J45), diabetes mellitus (E10, E11) and rheumatoid arthritis (M05, M06) (Fig. S3B). There were multiple shared SNP associations between abdominal infections and diverticular disease (ICD-10 code K57), as discussed above, and rs77438700 was associated both with bacterial pneumonia (ICD-10 code J18) and smoking.

Discussion

In this study, we explored genetic determinants of the susceptibility to phenotypes representing 18 bacterial and viral infection entities and identified 57 unique loci associated with at least one of the phenotypes. While many of detected significant variants are novel in the context of infectious diseases, the same or strongly correlated SNPs, and nearby genes of potential relevance in the pathophysiology of infections, were frequently found in previous literature. Most SNPs detected for abdominal infections were located close to loci reported associated to diverticular disease and diverticulitis (ICD-10 code K57), which was also the main driver of results for this phenotype in our study, in a GWAS by Maguire et al.23.

As expected, some of the identified loci are associated with infectious diseases or components of the host immune defence against bacterial and viral infections, such as the HLA region. Our findings align with a previous GWAS in which genetic variants in the HLA region were associated with several self-reported infections (e.g., mononucleosis, mumps, pneumonia, and tuberculosis)8. Bacterial infections are typically associated with MHC-II genes, and viral infections with the MHC-I region, which is important for peptide recognition in CD8 + cytotoxic T cells. The HLA region is also associated with multiple immunological traits including selective IgA deficiency, the most common primary immunodeficiency in Europeans55 and autoimmune diseases such as rheumatoid arthritis56, systemic lupus erythematosus57 and ulcerative colitis58. Interestingly, one study showed that one of the genes and its products, HLA-DQA2, is often transferred from cancerous cells to normal cells via extracellular vesicles in malignant colon cancer59. The transfer of this and other genes resulted in neoplastic transformation in fibroblasts. Also, alleles in HLA-DRB1 have shown association with on the composition of the gut microbiome56.

Genetic variants in the TRPS1 (rs2049865, P = 4.67e−10) and LINC01184 (rs6595799, P = 2.39e−08) loci, associated with abdominal infections and skin infections, respectively, showed associations with neutrophil and lymphocyte counts in a cohort of ~ 175,000 European-ancestry participants30. White blood cells are key components in the innate and adaptive immune responses to bacterial and viral infections60. Abdominal infections and gastroenteritis were associated with variants located in the SLC35F (lead SNP rs4333882, P = 2.14e−14) and FSTL5 loci (rare variant, EAF cases = 0.0014, lead SNP rs115809651. P = 8.07e−10), respectively. Although the biological functions of these genes are unknown, their associations with blood levels of cytokines (chemokines, interleukins, interferons)29 suggest potential importance for the innate immune response. Cytokines are key components in the biochemical pathways affecting migration and activation of white blood cells60 and are also fundamental in the biological processes of autoinflammatory diseases such as rheumatoid arthritis61 and inflammatory bowel diseases62.

Biologically plausible correlations were found between some of the infection phenotypes and chronic diseases, most frequently autoimmune diseases and cancer. While such co-morbidities increase the susceptibility for secondary infections, common genetic determinants that increase the risk for infections, inflammatory disease and malignancies could exist and be revealed either through studies of local genetic correlation or colocalization between traits. In this study, we observed colocalization of a variant associated with abdominal infections and gene expression in colon, suggesting causality of PPP1R14A in this class of infections.

This study has several strengths and limitations. To our knowledge, this is the largest interpreted GWAS to date on bacterial and viral infections using carefully determined compound phenotypes for important infection categories. External validation would have greatly added to the results but was not possible as other comparative data were unavailable. Replication using smaller biobanks with electronic health data would also be valuable to validate our findings. The definition of phenotypes based on specific diagnosis codes is a strength of this study, which is likely to increase sensitivity and specificity in relation to previous studies using self-reported history of diseases or ICD-10 codes without any curation. Still, some misclassifications are expected where the diagnosis set by the treating physician did not accurately describe the clinical syndrome; this situation may have resulted in false positive or negative cases and decreased the power of our analyses. It should be noted that there was sometimes an overlap in ICD-10 codes between phenotypes. As expected, there was some discrepancy in results between the combined phenotypes and subgroups (such as all RTIs vs. bacterial pneumonia). While the larger phenotypes are helpful to capture genetic variants related to the general systemic or local host immune defence, more specific phenotypes and larger cohorts may be required to find for example genetic determinants of pathogen-specific endothelial adhesion molecules. The conservative approach of refining the study cohort to correct for population structure and cryptic relatedness may have resulted in a lower estimated heritability. Further study is required to determine whether our observations result from genetic determinants affecting the risk for several disease groups or causal effects of co-morbidities that increase the vulnerability to infections.

Conclusions

In conclusion, we report multiple novel loci associated with bacterial and viral infections in a large population cohort and provide interpretation of these results in the context of previous literature. Our results add significantly to the limited existing data and biological insights in this field. The genetic determinants of infectious disease susceptibility identified in this study could potentially be used to help identify target genes for the development of novel therapeutics for prevention or treatment of these diseases.