Background

Gut microbiota is an enormous and complex ecosystem, which is closely associated with the host by affecting metabolism, immunity and other physiological functions [1, 2]. Numerous studies have suggested that the correlation of gut microbiota with the incidence of complex diseases. A case–control study showed the microbial pattern of women with breast cancer is different from healthy women in terms of bacterial type, relative abundance and function [3]. A cohort study of Indian Children found that the proportion of Firmicutes in Autistic Spectrum Disorder (ASD) children was higher than healthy children [4]. In addition, the gut microbiota might involve in modulation of body mass index and blood lipid level according to the LifeLines-DEEP population cohort study which consists of 893 subjects [5]. However, the mechanism of a large part of diseases induced by gut microbiota is still unclear, needing further research to elucidate.

The composition of the gut microbiota is shaped by multiple factors including environment, diet, medication as well as internal parameters [6]. In recent decades, great deal of evidence has indicated that host genetic factor plays indispensable role in shaping the gut microbial communities. Lim et al. found monozygotic twin pairs had more similar gut microbial communities compared with other family members, and 50 gut microbial taxa (58.8%) showed significant heritability among the 85 taxa identified with heritability estimates valued ranging between 13.1% and 45.7% [7]. Additionally, based on a large (n = 645) mouse advanced intercross line, microbial quantitative trait loci (mbQTLs) could significantly affect gut microbial taxa [8]. Moreover, microbial genome-wide association analysis (mGWAS) has been conducted in recent years to reveal loci related to the gut microbiota. According to a previous study, Lactococcus bacteria could be affected by single nucleotide polymorphism (SNP) rs2294239 in ZNRF3 gene, which is associated with body fat distribution [9].

The gut microbiota can be regarded as a trait affected by genetic factors [8]. Although GWAS has contributed to a great number of genetic clues related to complex diseases and traits, it has limitation in explaining how the genetic variations regulate gene expression alone because the SNPs identified mainly located in non-coding regions [10]. In recent years, expression quantitative trait loci (eQTLs) have been widely used to elucidate the influence of genetic variants at gene expression level [11]. Subsequently, integrated analysis of GWAS and eQTLs became practical in exploring the effect of gene expression on complex traits [12]. One such family of methods is transcriptome-wide association study (TWAS), which was conducted to impute expression from genetic data, showing great power to prioritize candidate genes of complex traits interested, and has been used to identify the associations between many diseases and genes [13]. For example, Liao et al. identified KAT2B and TMEM161B as causal genes for attention deficit hyperactivity disorder by TWAS [14]. Another TWAS detected 25 genes, including CELA3B, whose predictive expression was statistically significantly associated with pancreatic cancer risk [15]. To the best of our knowledge, no TWAS was applied in gut microbiota study until now.

In this study, we performed TWAS analysis and fine mapping of gut microbiota for multiple tissues by leveraging expression imputation from large-scale GWAS data sets. Subsequently, functional analysis was conducted for exploration of the biological functions and pathways of significant gene sets. Furthermore, we sorted out diseases associated with gut microbiota candidate genes by manually reviewing the literature.

Methods

mGWAS of gut microbiota

The human microbiota GWAS summary data were obtained from a study published by Hughes et al. [16]. The study projects consisted of 2223 individuals from the Flemish Gut Flora Project (FGFP) cohort. DNA was extracted from frozen fecal samples and used for 16S ribosomal RNA gene sequencing subsequently. Among 499 taxon-derived abundances in FGFP, 92 taxa met the analysis criteria, which were identified independent phenotypes. The presence/absence (P/A) phenotype (binary) and the zero-truncated (all zero values set as missing) abundance (AB) phenotype (continuous) were generated for taxa where > 5% of individuals in FGFP had an abundance measurement of zero. The genome-wide genotyping of FGFP was conducted using either the Human Core Exome v.1.0 array or the Human Core Exome v.1.1 array. Snptest.2.5.0 was used for association analysis. In brief, 157 microbial traits, including 62 presence/absence (P/A-HB) and 95 in abundance (AB-RNT) microbial phenotypes were included in the subsequent analysis. Detailed information on subjects, study design, statistical analysis and quality control can be found in the publication [16].

TWAS of gut microbiota

TWAS of gut microbiota was performed by FUSION software, which precomputed the gene expression weights of various tissues using a small set of individuals with both gene expression and genotype data. The cis-genetic component of expression was then imputed into much larger sets of phenotyped individuals according to SNP genotype data. In this study, we used Bayesian Sparse Linear Mixed Model (BSLMM) to calculate the SNP expression weight of a gene's 1-Mb cis loci [17]. Let w denotes the weights. Z denotes the scores of gut microbiota. L denotes the SNP-correlation matrix. The association testing statistics between predicted gene expression and each taxon was calculated as \({Z}_{TWAS}=w{{\prime}}Z/\left({w}^{{\prime}}Lw\right)1/2\). The imputed expression data can be regarded as a linear model of genotypes with weights based on the correlation between gene expression and SNPs in the training data, linkage disequilibrium (LD) among SNPs was considered [13]. Finally, the association between target traits and the expression level of genes was estimated by integrating analysis of mGWAS summary data with gene expression weights. The precomputed expression weights of tissues derived from the genotype-Tissue expression (GTEx) project were downloaded from FUSION websites (http://gusevlab.org/projects/fusion/). Specific in this study, we used the sigmoid colon and transverse colon as reference panels. Following the recommendation in FUSION software [13], we generated the cleaned mGWAS summary statistics data by leverage LD reference panel for further analyses, and the mGWAS summary statistics have not been trimmed or thresholded before. The percentage of SNPs in the LD reference available in the FGFP mGWAS data was approximately 13.8% for each microbial trait. We implemented 2000 permutation tests for each FUSION analysis to reduce the inflation of by-chance QTL co-localization. In this study, the analytical permutation P value (PPERM.ANL) < 0.05 were considered to be significant.

TWAS fine mapping

Fine-mapping of causal gene sets (FOCUS) approach was performed for prioritizing genes with strong evidence for causality in TWAS analyses [18]. FOCUS integrates GWAS summary data and expression prediction weights estimated from the eQTL reference panel, considering the LD of all SNPs in the risk region, and finally estimates the probability (probability estimates of causality, PIP) of any given gene set to explain the TWAS signal [18] for each gene. The gene included in 90%-credible set is more likely to be causal than any other gene in the region. Consistently with TWAS analyses, the transverse colon and sigmoid colon were used as the reference panels in FOCUS analysis. The threshold for screening of mGWAS summary data was 1 × 10–5 [16].

Functional analyses

The gut microbiota related genes identified by TWAS (PPERM.ANL < 0.05) were used for functional analyses by Functional Mapping and Annotation (FUMA) online platform [19]. P values were calculated by FUMA for each Gene Ontology (GO) term and pathway. The FDR P value < 0.05 was considered as significant.

Verification of gene and disease association

The literature mining was performed to show the lists of diseases related to the genes. The PubMed (https://pubmed.ncbi.nlm.nih.gov/) was searched to identify whether the significant genes of each taxon identified by TWAS were the causal gene of the target diseases.

Results

TWAS results

In total, the TWAS of 157 microbial traits were performed by FUSION. In presence/absence (P/A-HB) phenotype, 1693 genes were identified by TWAS for overall 62 microbial traits (Additional file 1: Table S1, Additional file 2: Table S2, Additional file 3: Table S3), such as TOB2P1 for Enterococcaceae in sigmoid colon (PPERM.ANL = 1.94 × 10–50), KCNIP3 for Veillonellaceae in transverse colon (PPERM.ANL = 8.35 × 10–33), WDR6 for Coprococcus in sigmoid colon (PPERM.ANL = 1.1 × 10–16). Accordingly, 2247 genes were detected for 95 microbial traits in abundance (AB-RNT) phenotype, such as WDR6 for Butyrivibrio in sigmoid colon (PPERM.ANL = 1.24 × 10–64), FBXO41 for Clostridium XlVa in transverse colon (PPERM.ANL = 1.47 × 10–21), CENPE for Veillonellaceae in sigmoid colon (PPERM.ANL = 2.30 × 10–17). Table 1 summarizes the top 20 significant genes associated with microbiota in two phenotypes, respectively.

Table 1 Top 20 candidate genes detected by TWAS in P/A and AB models

We summarized overlapped candidate genes for different microbial traits (Fig. 1, Additional file 4: Table S4), such as NDUFV3 for Lentisphaerae (HB), Bacteroidales (HB), Prevotella (HB), an unclassified genus of order Clostridiales (RNT), an unclassified genus of family Ruminococcacea (RNT), Victivallis (HB), Bacteroides (RNT), Sporobacter (RNT), an unclassified genus of phylum Bacteroidetes (HB), Chao diversity (RNT) and the number of genera observed (RNT); and SFTPD gene for Rhodospirillaceae (HB), Alphaproteobacteria (HB), an unclassified genus of phylum Proteobacteria (HB), Rhodospirillales (HB) and an unclassified genus of family Rhodospirillaceae (HB). Table 2 shows top 6 genes with the most repeats for microbial traits.

Fig. 1
figure 1

Top 14 overlapped candidate genes with the most repetitions in all microbial traits. Circos shows the top 14 candidate genes with the most repeats of all gut microbiota in transverse colon and sigmoid colon. The associations for each OTU with multiple genes are also exhibited. The labels on the left of the figure represent gene names, and the labels on the right are sorted alphabetically, representing different OTUs

Table 2 Top 6 overlapped candidate genes for different microbial traits

Fine mapping results

We performed fine mapping by FOCUS for 157 microbial traits with two reference panels, and finally found 11 genes included in 90%-credible sets, indicating the genes may causally associated with microbial traits (Table 3). Among them, 3 genes have been identified in TWAS analyses: HELLS for Streptococcus (RNT) (PIP = 0.685) in sigmoid colon, HELLS for Streptococcaceae (RNT) (PIP = 0.665) in sigmoid colon, ANO7 for Erysipelotrichaceae (RNT) (PIP = 0.449) in sigmoid colon, and STAG3L4 for Lachnospiraceae (RNT) (PIP = 0.171) in transverse colon.

Table 3 Potentially causal genes for microbial traits detected by FOCUS

Functional analyses results

The significant genes identified by TWAS for each microbial trait in the two tissues were subjected to functional analysis (Additional file 7: Table S7). Totally, we detected 94 GO terms in two phenotypes. For instance, GO_NUCLEOSIDE_DIPHOSPHATASE_ACTIVITY was significant for Butyrivibrio (RNT) (FDR P = 1.30 × 10–4), GO_CONDENSED_CHROMOSOME_CENTROMERIC_REGION was significantly associated with Acidaminococcus (HB) (FDR P = 1.17 × 10–3), GO_SPECTRIN_BINDING was detected to be correlated with Burkholderiales (RNT) (FDR P = 1.69 × 10–3), and GO_VACUOLE was associated with Enterobacteriaceae (RNT) (FDR P = 2.84 × 10–3).

FUMA also identified 11 pathways related to microbial traits, such as KEGG_RENIN_ANGIOTENSIN_SYSTEM for Anaerostipes (RNT) (FDR P = 3.16 × 10–2), KEGG_PURINE_METABOLISM for Veillonellaceae (HB) (FDR P = 7.35 × 10–3), KEGG_JAK_STAT_SIGNALING_PATHWAY for Enterococcaceae (RNT) (FDR P = 2.60 × 10–2). Table 4 shows the top 10 gene ontology terms and KEGG pathways of the significant genes.

Table 4 Top 10 significant GO and KEGG pathways for microbial traits

Association between candidate genes and diseases

The selected top genes in Tables 1 and 2 were searched on PubMed website to explore the possible relationship with diseases, and 12 genes were found to be associated with 12 diseases (Table 5). For instance, HELLS for Streptococcus in sigmoid colon was related to colorectal cancer [20], and SFTPD for an unclassified genus of Proteobacteria in transverse colon was detected to be related to atherosclerosis [21]. Specifically, although not included in the top genes, FUT2 for Bifidobacterium was suggested to be the causal gene for Crohn's disease (CD) in previous study [22].

Table 5 The list of candidate genes associated with diseases

Discussion

Host genes have been shown to be closely related to the ecosystem of the gut microbiota. Previous studies have detected multiple candidate genes associated with specific taxa [23,24,25]. Recent studies indicated that noncoding regulatory regions play an important role in influencing human complex traits. The gut microbiota was once suggested as a complex trait of the host affected by mbQTL [8], so we speculate that the host can influence the composition of the gut microbiota and the abundance of specific groups by regulating gene expression. In this study, TWAS was performed to prioritize candidate genes affecting gut microbiota at gene expression level by integrating GWAS summary data and specific pre-computed tissue expression profile. Finally, we identified numbers of genes and pathways related to microbial traits, and some of the genes have been reported to be associated with specific diseases by previous studies.

TWAS and fine mapping both prioritized several candidate genes for gut microbiota, such as HELLS for Streptococcus in sigmoid colon, ANO7 for Erysipelotrichaceae in sigmoid colon. We attempted to explore the relationship between gut microbiota candidate genes and diseases. HELLS encodes lymphoid specific, which participates in the establishment and maintenance of DNA methylation with chromatin remodeling through its ATPase activity [20]. HELLS expression was proved to be significantly associated with the colorectal cancer progression and a higher pathological grade [20]. Aberrant bands of the HELLS was observed in seven colorectal cancers by polymerase chain reaction-based single strand conformation polymorphism assay [26]. Streptococcus has been identified as colorectal cancer candidate pathogens in previous researches [27, 28]. ANO7 has been found to play a central role in prostate cancer progression, and its elevated expression correlates with disease severity and outcome [29]. Notably, the abundance of Erysipelotrichaceae was observed to be increased in prostate cancer patients [30]. In the treatment of prostate cancer by androgen axis targeted therapy, men receiving the treatment showed a significant decrease in the abundance of sequencing reads assigned to Erysipelotrichaceae [31]. In gut microbiota of mice, the abundance of Erysipelotrichaceae was also different between cancer bearing mice and healthy mice [32].

FUT2 was detected to be associated with Bifidobacterium in transverse colon in TWAS. FUT2 gene encodes α-1, 2-fucosyltransferase for the expression of ABH blood group antigens on mucosal surfaces, and determines the ability to secrete blood group antigens into gastrointestinal secretions. Individuals who have homozygous non-coding variants in FUT2 are nonsecretors, and ABH antigens are not expressed in mucosal secretions and surfaces, generally called as sese [33, 34]. Accordingly, secretory type was expressed as SeSe and Sese [34].

The alterations of FUT2 genotype resulted in a significant shift of microbial composition, that is, the gardening effect of FUT2 polymorphism on phylogenetic composition of the gut microbiota [34]. Present studies consistently show the genome-wide significant association between FUT2 non-secretor allele and CD in various races [22, 35]. It is suggested that FUT2 gene loss-of-function allele homozygotes change the gut microbiota of CD patients [36,37,38,39]. FUT2 polymorphism may also partly contribute to CD susceptibility by shaping community composition and structure of microbiota [36, 37]. Previous studies showed genus Bifidobacterium had higher diversity, richness and abundance in secretors compared with non-secretors [40, 41]. Moreover, increased genus Bifidobacterium is related to successful clinical outcome or remission of therapy in CD [42]. Further studies are warranted to identify the interactions between FUT2, Bifidobacterium and CD.

TWAS also identified SFTPD as a candidate gene for an unclassified genus of Proteobacteria in transverse colon. SFTPD encodes surfactant protein D, which is an important host defense lectin. It aggregates and enhances phagocytosis of microbes and dying host cells [43]. SFTPD is mainly expressed in lung, but also distributes in gallbladder and gut, and could shape intestinal microbial ecosystem [43]. Some potential evidence has carried out the link between SFTPD and phylum Proteobacteria. Nexoe et al., found a strong positive correlation between inflammatory activity and expression of SFTPD in the intestinal epithelium from Inflammatory Bowel Disease (IBD) patients [44], while the increase of Proteobacteria is one of the most consistent observations in IBD individuals [45].

SFTPD was reported exacerbating the development of atherosclerosis in previous literatures [21, 46,47,48]. In recent decades, bacterial infections and chronic inflammation have become possible causes of cardiovascular disease. Atherosclerosis is a chronic inflammatory process driven by lipids in the walls of the great arteries [49]. SFTPD has been proved to play a predominant role in pro-inflammatory [50, 51]. According to previous studies, the genus of Proteobacteria were involved in the formation of atherosclerosis. For instance, Proteus vulgaris was found to be present in the plaques and intestines of the same individual [52], Proteus mirabilis can interact with atherosclerosis plaques in human coronary arteries via specific molecular to exacerbate the progression of disease [53]. In addition, the abundance of Proteus in the blood of cardiovascular disease patients was observed to be increased compared with healthy individuals [52]. In mouse disease models, the reduction of phylum Proteobacteria abundance can exert a therapeutic effect on atherosclerosis [54]. Since the SFTPD is related to the abundance of bacteria from phylum Proteobacteria based on our findings, we hypothesized that the microbiota could affect susceptibility to atherosclerosis by genetic regulation.

KEGG_RENIN_ANGIOTENSIN_SYSTEM was detected to be associated with Anaerostipes in functional analysis. In a recent study, the fewer abundance of Anaerostipes was observed in primary aldosteronism patients than healthy individuals [55]. Bier et al. have confirmed that high salt diet could lead to decreased the abundance of taxa from the Anaerostipes genus [56]. Moreover, Anaerostipes was found to be correlated with higher estimated glomerular filtration rate in the overall population [57].

To the best of our knowledge, we conducted the first large-scale comprehensive sigmoid colon and transverse colon tissue-specific TWAS for gut microbiota, and performed fine mapping based on TWAS for further confirmation. The candidate genes for gut microbiota were further explored for the link between various taxa and diseases. Our study also has three potential limitations. First, only individuals of European ancestry from Germany and Belgium were included in the analysis, so the results cannot be generalized to other ethnic groups. Second, the information about diet and drug use of individuals is lack so that we can’t rule out the effects of diet and medication on the composition of gut microbiota. Third, it should be marked that the purpose of this study is to screen and prioritize candidate genes for gut microbiota, the results should be interpreted with caution. At present, research based on the interaction of genes and gut microbiota still needs more extensive exploration, further functional studies should be performed to confirm our findings and elucidate the mechanisms which genes act on gut microbiota.

Conclusions

To be conclude, we performed TWAS analyses and identified multiple candidate genes and pathways of gut microbiota. We found that some candidate genes may also involve in the susceptibility of diseases, and attempted to provide clues for revealing the influence of genetic factors on gut microbiota for the occurrence and development of diseases. Our findings may provide new insight into the influence of genetic factors on the composition of gut microbiota, in addition to suggesting the potential role of gut microbiota in the mechanism of genetic factors contributing to disease susceptibility. Further studies are needed to demonstrate specific biological mechanisms in the future.