Host genetic variation impacts microbiome composition across human body sites
The composition of bacteria in and on the human body varies widely across human individuals, and has been associated with multiple health conditions. While microbial communities are influenced by environmental factors, some degree of genetic influence of the host on the microbiome is also expected. This study is part of an expanding effort to comprehensively profile the interactions between human genetic variation and the composition of this microbial ecosystem on a genome- and microbiome-wide scale.
Here, we jointly analyze the composition of the human microbiome and host genetic variation. By mining the shotgun metagenomic data from the Human Microbiome Project for host DNA reads, we gathered information on host genetic variation for 93 individuals for whom bacterial abundance data are also available. Using this dataset, we identify significant associations between host genetic variation and microbiome composition in 10 of the 15 body sites tested. These associations are driven by host genetic variation in immunity-related pathways, and are especially enriched in host genes that have been previously associated with microbiome-related complex diseases, such as inflammatory bowel disease and obesity-related disorders. Lastly, we show that host genomic regions associated with the microbiome have high levels of genetic differentiation among human populations, possibly indicating host genomic adaptation to environment-specific microbiomes.
Our results highlight the role of host genetic variation in shaping the composition of the human microbiome, and provide a starting point toward understanding the complex interaction between human genetics and the microbiome in the context of human evolution and disease.
Expression quantitative trait locus
Genome-wide association study
Human microbiome project
Inflammatory bowel disease
Operational taxonomic unit
Quantitative trait locus
Single nucleotide polymorphisms
Recent advances in high-throughput sequencing technologies have unveiled wide variability in the microbial communities that coat the human body [1, 2]. There are differences in the microbiota across body sites, which constitute distinct ecological niches [1, 3, 4]. Within each body site, the composition of the microbiome may change rapidly, but community features can remain constant for years [5, 6]. There is great variability in the microbiome across individuals, with some differences associated with chronic conditions, including obesity, diabetes, and inflammatory bowel disease (IBD) [7, 8, 9, 10, 11, 12]. Recent studies in germ-free animals have shown that these shifts in the microbiome can have an effect on host traits and could be causal in disease phenotypes [7, 12, 13, 14]. Therefore, understanding the factors that impact the composition of the microbiome in healthy individuals is critical to elucidate the role of the microbiome in disease and for development of therapeutics targeting the microbiome.
The composition of the human microbiome is influenced by multiple environmental factors. For example, changes in host diet affect gut microbiome communities at the taxonomic and functional level [5, 15]. In addition, intake of drugs and antibiotics can modulate the gut microbiome [16, 17]. Moreover, studies have shown variation in the gut microbiome can be controlled by interactions with pathogens and parasites [18, 19]. Lastly, social contact and interaction with the environment have also been implicated in shaping the microbial flora in the gut and skin [20, 21, 22].
Along with this clear evidence for the influence of environmental factors, there is also support for a host genetic component in structuring of human microbial communities . For example, single nucleotide polymorphisms (SNPs) in the MEFV gene are associated with changes in human gut bacterial community structure , and IBD-risk loci are associated with changes in gut microbiome composition . Researchers have also shown that a loss-of-function polymorphism in the gene FUT2, which is a known risk factor for Crohn’s disease, may modulate energy metabolism of the gut microbiome . Investigating individuals with inflammatory bowel disease, Knights et al. have shown that NOD2 risk allele count is correlated with an increase in the relative abundance of Enterobacteriaceae .
In addition to targeted and candidate gene approaches, researchers have also used host genome-wide genetic variation to find interactions with the microbiome. For example, in a recent study using 416 twin pairs to assess the heritability of the microbiome, Goodrich et al. identified microbial taxa for which relative abundance is more similar in monozygotic compared to dizygotic twins . In the laboratory mouse, quantitative trait locus (QTL)-mapping approaches have found multiple loci associated with gut microbial community composition, some of which overlap genes involved in immune response [28, 29]. Moreover, researchers have shown that host mitochondrial DNA haplogroups are correlated with the structure of microbiome communities . However, to date, there are no genome-wide studies that attempt to characterize specific genes and pathways in the human genome that shape the composition of the microbiome, although the value of such studies has often been suggested [31, 32].
Here, we performed a genome-wide analysis to identify human genes and pathways correlated with microbiome composition, using data generated by the Human Microbiome Project (HMP). In the last few years, the HMP has sampled and cataloged the microbial diversity across multiple body sites in a few hundred individuals . Since genotype data are not yet available for the individuals included in the HMP study, we extracted host genomic information from the ‘human contamination’ reads in the HMP shotgun metagenomic sequencing. This allowed us to generate genome-wide genetic variation data from 93 individuals, which we then tested for association with the microbiome profiles generated by the HMP.
Results and discussion
Mining the human microbiome project data for host reads
First, we scanned and identified the short reads in the metagenomic sequencing data that map to the human genome. By combining these reads across body sites (primarily originating from nares and cheek swabs ) for each individual (Additional file 1: Figure S1), we attained a mean depth of coverage of more than 10 reads per base pair per individual (Additional file 1: Figure S2). Combining all 93 individuals, the mean depth of coverage for each site is 1,061 reads (median 1,093), and 99 % of sites are covered at >500x summed across individuals. There is noticeable variability across individuals, although most individuals have a mean coverage in the range of 5x-20x (Additional file 1: Figure S3). We performed genotype calling on these individuals using stringent quality controls and filtering, and identified a final set of 4.2 million high-quality and informative single nucleotide polymorphisms (SNPs), of which 92 % were previously known and found in dbSNP, and were used in subsequent analyses (Additional file 1: Figures S1 to S10). The number of SNPs we identified is in line with previous reports using whole-genome sequencing in humans .
Correlation between host genetic variation and microbiome composition
This dataset also allows us to compare between-individual differences in the microbiome and host genetic variation. We correlated microbial beta diversity (that is, between-sample diversity) at each body site with genome-wide identity-by-state, a statistic estimating similarity in genome sequence between pairs of individuals. We found that identity-by-state is significantly negatively correlated with beta diversity in 10 of the 15 body sites (Additional file 1: Figure S13), including in the stool (Fig. 1c, R2 = 0.19, P <1015), anterior nares, hard palate, palatine tonsils, saliva, supragingival plaque, throat, and tongue dorsum (P <0.01 in each of the 10 body sites). These results indicate that the similarity in genome sequence is positively correlated with microbiome similarity, supporting a relationship between host genetic variation and the microbiome at a large scale. However, this pattern may be partly driven by population stratification, or non-genetic environmental factors that are correlated with genetic ancestry. For example, previous studies have found differences in the gut microbiome between human populations [36, 37], so geographic stratification could drive a biologically non-causal correlation between genetic ancestry and local diet, and thus with gut microbial composition.
Host genes and pathways correlated with microbiome composition
In an effort to control for population structure, in addition to other non-genetic factors that may be driving spurious correlations, we analyzed the data using a linear mixed model. The additive effects model included as covariates possible confounders, such as gender, sample collection location, sequencing center, and the first five coordinates from the MDS analysis of the host genotypic data. By including these covariates we are attempting to correct for effects of individual ancestry and extrinsic factors on the microbiome. We note that there are additional potential confounding factors that we could not account for in our model; for example, physical interaction between individuals, which has been shown to affect microbiome composition in primates , is not included, as these data were not collected by the HMP. We ran this model genome-wide, correlating host genetic variation in each SNP with the first five PCs of the microbiome in each of the 15 body sites. In addition to controlling for confounders, this genome-wide approach also allows us to identify specific loci in the host genome that are correlated with the microbiome, and understand their likely functional effect in the host. We recognize at the outset that our sample size is an order of magnitude smaller than most genome-wide association studies (GWAS), precluding us from being able to perform a standard test of association between microbiome composition and each SNP. Therefore, instead, we used a pathway-based analysis, whereby we aggregated SNPs into pathways in order to learn about the biological functions and processes that underlie interactions between host genome and the microbiome. We note that this is a common analysis approach for genome-wide association data, driven by the rationale that complex traits are controlled by multiple genetic effects, which could originate in different genes, but are likely to aggregate in the same biological pathway or function. The approach is aiming to identify these functions by looking for enrichments of biological functional categories among a set of associated genetic loci. Specifically, we first aggregated SNPs that were correlated with at least one microbiome PC at an arbitrary nominal cutoff of P ≤10−6 (using several other P value thresholds did not change the results; see Additional file 2: Tables S1 and S2). We then identified overlapping or nearby genes, and used these gene sets to perform a functional enrichment analysis.
Using this approach, we found the most significant enrichment with genes involved in pathway Leptin Signaling in Obesity (P = 2.29 × 10−7, Additional file 2: Table S1). Leptin is a hormone whose structure places it in the cytokine superfamily. It has been linked to the microbiome in several recent studies, mainly using leptin-deficient ob/ob mice [13, 38]. Leptin has several important roles in immunity, including activation of monocytes, neutrophils, and macrophages, and modulation of inflammation . Leptin may also impact the microbiome indirectly in its role as a hormone, whereby it regulates appetite and body weight, affects basal metabolism, and regulates insulin secretion, among other functions . The enrichment identified here is driven by significant correlations of host genetic variation with microbiome PCs in the nose, oral cavity, and skin (see Additional file 2: Table S1). Studies have shown that the leptin is expressed and has a functional role in the mouth . Leptin and leptin receptor are expressed in the skin , and may have a functional role in wound healing and psoriasis [42, 43]. Moreover, leptin is expressed in nasal polyps, and may affect the expression of mucin genes in polyp epithelial cells . Nevertheless, the role of leptin in interactions with microbial flora in these body sites is still not well understood.
In addition to leptin signaling, several other immunity-related pathways are enriched among microbiome-correlated host genes, such as Melatonin Signaling, JAK/Stat Signaling, Chemokine Signaling, CXCR4 Signaling, and Role of Pattern Recognition Receptors in Recognition of Bacteria and Viruses (Additional file 2: Tables S1 and S2). To further investigate the role of host genetic variation in immunity-related genes on the microbiome, we used the InnateDB database, and identified additional enriched pathways, including Interleukin-12-Mediated Signaling Pathway, GABAA Receptor Activation, Inositol Phosphate Metabolism, IL2, CXCR4-Mediated Signaling Events, and GnRH Signaling Pathway (Additional file 2: Tables S3 and S4). In addition, we found enrichment of genes in the REACTOME pathway Sulfide Oxidation to Sulfate, suggesting a potential role for host genetic variation in genes determining sulfate abundance in controlling microbial composition. We also found enrichment in the KEGG pathway Primary Bile Acid Biosynthesis. Recent studies have shown that the microbiome can modulate bile acid metabolism , and our results support a possible role for host genetic variation in bile acid metabolic pathways in interacting with the microbiota.
We used a similar approach to identify enrichment of SNPs annotated as expression quantitative trait loci (eQTLs) among the sites we found to be correlated with microbiome composition (Fig. 2b). We found an enrichment of eQTLs in several tissues that were identified in the GTEx project . This result indicates that the loci we identified in our analysis as correlated with microbiome composition are likely to have a functional role in regulating gene expression. Lastly, we sought to validate our results using an independent cohort. We followed a similar approach to identify correlations between GI tract microbiome PCs and host genetic variation in 984 individuals from the TwinsUK project cohort [14, 50]. We find an enrichment of SNPs correlated with microbiome composition in both studies (Fig. 2c; P = 0.028 using Fisher’s exact test for significant overlap between the two sets of SNPs). When considering genes located nearby correlated SNPs, the enrichment becomes more prominent; possibly indicating that different SNPs may control similar microbiome-linked genes and pathways.
Host genetic variation correlated with bacterial taxa
In addition to identifying interactions with the overall structure of the microbiome, we were interested in finding correlations between host genetic variation and specific bacterial taxa. To do so, we tested for correlation between genetic variation and relative abundances of bacteria derived from the HMP 16S rRNA gene sequences. Abundance data from HMP OTUs were parsed, extensively filtered, normalized, and taxonomically collapsed, to achieve a single representation for each taxon at the genus level or above (see Additional file 1: Figures S14-S19 and Additional file 3). After filtering inter-correlated taxa, our final dataset included 615 microbiome abundance traits in 15 body sites. In an effort to reduce the number of statistical tests, we included in the analysis only host SNPs located within protein-coding sequences.
Using pathway enrichment approaches described above, we found that genes linked to abundance of bacterial taxa are over-represented with relevant diseases (Additional file 2: Table S6), including transendothelial migration of lymphocytes, meningitis, and several cancer categories, including gastrointestinal adenocarcinoma, growth of mammary tumor, head and neck tumor, and thyroid cancer. To further visualize the interactions between these genes, we used the Ingenuity Pathway Analysis knowledgebase, which holds curated information on molecular pathways and protein interactions, and identified several networks significantly enriched with genes correlated with bacterial taxa abundances (Additional file 2: Table S7). Figure 3c displays the highest-scoring network, containing genes involved with cellular movement, hematological system development and function, and immune cell trafficking.
These results suggest that host genetic variation that is linked to microbial variation is enriched with sites that evolve under differential selection pressures across human populations. This is consistent with the notion of local adaptations to population-specific microbiomes, possibly controlled by environmental conditions for each population. Given that genes that we found to be linked to microbiome composition are enriched with immunity-related genes and pathways, this result may not be surprising; indeed, genetic variation in immune genes has long been associated with higher rated of positive selection in human populations . However, these selective pressures were hypothesized to be mainly a result of interaction with pathogens. Our results indicate that selection pressures on immunity genes and pathways may also be due to interaction with commensal microbial communities that accompany changing environments. Another potential explanation for this pattern is that past selection pressures against pathogens have driven changes in immunity genes that affect the commensal microbiome as a byproduct. Although distinguishing between these hypotheses is not possible using currently available data, the end result – commensal microbial traits affected by past selection events on host genes – is an exciting finding that we hope would be explored further in the future.
We describe an analysis of host genetic variation data mined from the metagenomic shotgun sequencing performed by the Human Microbiome Project. The ability to mine host genetic material from metagenomic shotgun sequence data has recently raised several privacy concerns . We note that in the current study, informed consent for sequencing of host DNA was given by the participants, although this is not a common procedure for metagenomics studies. We show here that it is possible to reconstruct complete host genomes using metagenomic sequence data, which is potentially identifiable. However, this was possible due to the unique study design of the HMP, whereby multiple body sites from each individual were sequenced at a high depth, allowing us to pool data across body sites and reach a 10x mean coverage per host genome. Common metagenomic shotgun sequencing studies, which usually include an order of magnitude less sequence data, are unlikely to enable such an analysis. Moreover, the majority of studies sequence stool samples, which include many fewer host-derived reads. Nevertheless, we anticipate that future shotgun metagenomics sequencing studies would consider these potential privacy concerns.
The analysis described in this paper focused on the taxonomic structure of the microbiome. However, it would be interesting to incorporate the functional composition of the microbiome when considering associations with host genetic variation. Indeed, several studies have highlighted the importance of shotgun metagenomics for uncovering the genic composition and metabolic capacity of the microbiome [1, 48]. A similar analysis would be critical to uncover functional interactions that could not be detected by looking at community and taxonomic composition. In addition, there are several environmental factors that could influence the microbiome, such as diet, which were not included in our analysis. We expect that the inclusion of such potential confounders in future studies would help to further disentangle the effects of environment and host genetic variation on the microbiome.
Our analysis has shown that host genetic variation in immunity-related pathways is correlated with microbiome composition. These results are consistent with recent reports of host immunity involvement in modulating microbiome structure, for example through production of antimicrobial compounds  or inflammation . Additionally, many recent studies have shown that a mice with a knocked-out immune gene display dramatic changes in their microbiota [57, 58, 59, 60]. Moreover, genetic variation in immune genes in the mouse was found to be correlated with the composition of the microbiome . In addition, our results show that the host variants and genes that are correlated with the structure of the microbiome are enriched in genes associated with complex disease that have been linked to the microbiome. This result is not surprising, considering that recent studies in the mouse have shown that microbiome QTLs overlap complex disease-linked genes [28, 29]. Taken together, these findings motivate the need for larger association studies to characterize host genetic variation linked to the microbiome in the context of various health conditions, environmental effects, and genetic backgrounds. Moreover, functional studies, for example using cells or animal models, would be crucial for elucidating the causal mechanisms whereby human genetic variation impacts the microbiome.
Materials and methods
A full and detailed description of the Methods is available in the Additional file 3 document.
Recruitment protocols were approved by Institutional Review Boards at each HMP clinical site, and written informed consent was obtained from all study participants for data sharing through dbGap. All study participants have consented for the sequencing of their own genetic material . Specifically, the HMP human subjects study was reviewed by the Institutional Review Boards (IRBs) at each sampling site: the BCM (IRB protocols H-22895 (IRB no. 00001021) and H-22035 (IRB no. 00002649)); Washington University School of Medicine (IRB protocol HMP-07-001 (IRB no. 201105198)); and St Louis University (IRB no. 15778). The study was also reviewed by the J. Craig Venter Institute under IRB protocol 2008–084 (IRB no. 00003721), and at the Broad Institute of MIT and Harvard the study was determined to be exempt from IRB review.
Host read data acquisition, filtering, and alignment
The processing of the raw data files through the genotyping step was performed on the compute cluster at the Broad Institute. We downloaded 1,553 raw Illumina read files (total of 8 TB) in SRA format, representing samples from 98 individuals (HMP subjects), from the dbGaP database. The files were decrypted, and converted to FASTQ format using NCBI’s SRA toolkit (version 1.0.0-b10) with default parameters. A total of 152 files that failed the standard Illumina quality checks were excluded from the downstream analysis. The reads from the remaining 1,401 files were aligned to the human genome (build hg19) using BWA v0.5.7  with default settings for the alignment, except for the ‘bwa sampe’ step, where the option ‘-a 2000’ was used to change the maximum insert size from default 500 to 2,000. Out of the 79,877,504,468 post-filter reads, 35,828,514,379 were mapped to the human genome. The 1,401 BAM files were reorganized by merging reads from different samples from the same subject into subject BAM files using samtools . The merging failed for one individual (due to corruption of the original sample BAM files), and for four others the merged BAM files contained only reads from stools samples very little human DNA present. These five subjects were excluded, leaving 93 individuals. The average number of mapped reads per individual was 365 million.
Genotype calling, filtering, and QC
Variants (SNPs and short indels) were called from all 93 cleaned and re-aligned BAM files using the GATK’s UnifiedGenotyper function with standard emission confidence parameter set to 3.0 (−stand_emit_conf 3.0). This value, much lower than the GATK default, was used in order to provide an exhaustive list of possible variants for subsequent filtering. The coverage for each individual was down-sampled to 200 (that is, the option –dcov 200 was used). Other options of UnifiedGenotyper were kept at their default values. The calculation was parallelized over genomic coordinates by splitting the genome into 80,000 bp intervals and running UnifiedGenotyper for each of these intervals on a separate processor of the compute cluster. After excluding contigs that did not map to a known chromosome, this unfiltered, low-pass genotype set included 19,377,382 SNPs and 3,519,487 short InDels. In order to filter the genotype calls and keep only high-quality variants, we used GATK and applied several hard filters that are recommended for low-coverage whole-genome data . Specifically, we excluded SNPs with low mapping quality, SNPs with a strand bias, and SNPs that are otherwise of low quality. In addition, we masked out SNPs that are near InDels using a window size of 10. Lastly, we excluded any SNPs for which there is missing information and a clear filter decision could not be made.
Next, we performed variant score recalibration on the SNPs that have passed the above filters using the GATK VariantRecalibrator. As input to train the model, we used three input SNP sets: (1) HapMap3.3 SNPs; (2) dbSNP build 132 SNPs; and (3) 1000 Genomes Project SNPs from Omni 2.5 chip. After applying the recalibration using the GATK ApplyRecalibration command and excluding variants that did not pass the various filters, we were left with 13,190,940 SNPs across the 93 individuals. Of this set, 7,229,492 SNPs (60.3 %) were also found in dbSNP. As quality control, we plotted the number of sites filtered out by each filter or combination of filters, as well as the Ti/Tv ratio for each filter combination (Additional file 1: Figure S5). The sites that passed our filtering criteria have the highest Ti/Tv ratio (mean 2.1), which is close to the expected value observed in many sequencing projects, including the 1000 Genomes Project pilot data (genomic average Ti/Tv of 1.96) . When we consider the frequency spectrum of alleles in our sample (Additional file 1: Figure S6), we see an enrichment of low-frequency variants, as consistent with many recent population-scale sequencing studies . We see a similar distribution when we consider allele sharing across individuals (Additional file 1: Figures S8 and S9), with most alleles appearing in only one individual. Since alleles at lower frequencies are less informative for association analysis, we excluded from downstream analysis SNPs that are at frequency of less than 5 % in our sample, leaving us a set of 5,536,004 SNPs. Of this set, 5,108,016 SNPs are also found in dbSNP (92.3 %). We further filtered this set keeping only SNPs with minor allele frequency above 10 %, SNP with P value >10−3 for Hardy-Weinberg equilibrium, autosomal SNPs, and SNPs with less than 50 % missing information. The final set included 4,205,323 SNPs that set that passed these QC thresholds and were used in the analysis. Pairwise identity-by-state (IBS) distances between individuals were calculated from the filtered SNP data using PLINK [67, 68]. We performed metric multidimensional scaling analysis (MDS) on the pairwise IBS distance matrix using PLINK.
Correlation and enrichment analysis
Where λ was calculated using the function box.cox.powers in R (in the package ‘car’). Correlation analysis of normalized trait values was performed in PLINK v1.07 , and included the following covariates: (1) Individual sex (binary variable); (2) Individual age; (3) Site where microbiome data were collected; (4) Center where sequencing was performed (this was coded as binary variables representing the four collection centers: BCM (Baylor College of Medicine), BI (Broad Institute), JCVI (J. Craig Venter Institute), and WUGC (Washington University Genome Center); (5) The total number of sequences for each individual in the metagenomic sequencing data; and (6) The positions on the first five dimensions in the MDS analysis of the genotype data. In addition to the microbiome PCs, we also ran a similar correlation analysis for a set of microbiome taxa, following a comprehensive filtering of the 16S OTU data as described in Additional file 3. To reduce the multiple test burden, this analysis was performed on a set of protein-coding host SNPs, which were identified after annotation of the SNP data using ANNOVAR , and included 33,814 protein-coding SNPs.
We considered SNPs correlated with microbiome PCs with P value ≤10−6, and identified genes that overlap or are located ≤50 kb from these SNPs, using data for all known human genes taken from the refGene table (hg19 genome build). The identified genes were used as input to functional enrichment analysis, performed using Ingenuity Pathway Analysis (IPA; August 2012 software release), a program that uses Ingenuity’s high-quality knowledge base, which includes curated information on genes, pathways, and interactions (see ). IPA generates a P value using a Fisher’s exact test comparing the expected and observed genes in a given pathway. The most enriched canonical pathways are listed in Additional file 2: Table S1. To identify the bacterial taxa driving these enrichments, we calculated correlations between each OTU and the PCs in each body site. The most highly correlated OTU for each PC where correlation with host genetics was found is listed in Additional file 2: Table S1. We also used the InnateDB database  to identify enrichment of specific gene ontology (GO; [72, 73]) categories (Additional file 2: Table S3) and additional pathway databases (Additional file 2: Table S4), including KEGG   and Reactome . To make sure the specific cutoff values chosen in this analysis do not affect the enrichment result, we repeated this analysis with varying P value and gene distance cutoffs (see Additional file 2: Table S2). Specifically, we used two P value cutoffs (P ≤10−6 and P ≤5×10−7) and three gene distance cutoffs (D ≤50 k, D ≤20 k, and D ≤5 k), and examined the enrichment P value and rank of pathways of interest (Additional file 2: Table S2).
The data from the GWAS Catalog  and the GTEx consortium  presented in Fig. 2 were downloaded from  in June 2013, and the GTEx portal  on October 2013, respectively. The enrichment plots shown in Fig. 2 were calculated as follows: given a dataset (for example, GWAS catalog genes involved in obesity-related traits), and given a P value cutoff (Pi, shown on the x-axis of the figure), we identified the set of genes or SNPs for which P ≤Pi. Next, we calculated the overlap between Gi and the genes or SNPs identified to be correlated with the microbiome in the current paper. The fold enrichment (y-axis) for Pi is the number observed compared to expected overlapping genes or SNPs, where the expected number is the overlap among genes or SNPs not in Gi.
To identify enrichment in an independent cohort, we used data from the TwinsUK Project, which included both stool microbiome 16S data, as well as host genetic data assessed by SNP genotyping, from 984 adults . OTU tables and PCs were generated using the QIIME pipeline as described above [79, 80, 81, 82]. Host SNP genotyping data were fully imputed using IMPUTE version 2, and quality checked as previously described . SNPs were removed if they had a minor allele frequency below 5 %, a genotyping rate below 95 % or extreme deviation from HWE (P <0.001). Deviation from HWE was determined using the genotypes from only a single twin from each twin pair. Only imputed SNPs with an imputation accuracy score (IMPUTE INFO field) greater than 0.9 were included in the analysis. The final number of SNPs used for the association analysis was 1,310,141. To test for correlation between host SNPs and fecal microbiome PCs, we used the score test implemented in the software Merlin  to account for the relatedness of the individuals (option –fastassoc). The recombination rates from HapMap II release 22 were used as the genetic map input to Merlin. Model covariates included the number of sequences per sample, sample batch, sequencing run, the person that extracted the DNA, the gender, the age, and the first three PCs of the MDS. After quality filtering of traits and genotypes, 170 MZ twin pairs, 241 DZ twin pairs, and 162 unrelated individuals were included in the association analysis. For the analysis shown in Fig. 2c, we used correlation P values for SNPs and nearby genes, and calculated fold-enrichment for several P values as described above.
We used FST data downloaded from the database of recent positive selection across human populations  via  in March 2014. We compared FST values in SNPs that were correlated with microbiome PCs with P <10−4 in each of the four body sites and the rest of the SNPs in our sample. To compare two sets of FST values we used a permutation test on the medians as follows: we randomly split the data into two groups the same size of the two original groups, and calculated the difference in medians between the two groups. This process was repeated 10,000 times, and the P value was defined as the proportion of permutations in which difference in medians was greater that the real difference between the two original groups. Figure 4 shows all the comparisons made and highlights in color cases where the calculated P value was smaller than 10−3. The error bars in the figure are 95 % confidence intervals that were calculated using bootstrapping as follows: for a given set of FST values, we subsampled with replacement a sample of the same size, and calculated the median of the sample. This was repeated 10,000 times, with the median recorded in each iteration. The 95 % CI was defined as the range between the 2.5 and 97.5 percentiles of all subsample medians.
16S rRNA gene sequence data and OTU tables are available on the HMP DACC website . Host genetic data are deposited in dbGaP under project number phs000228.
We thank the members of the Blekhman, Clark, Ley, and Keinan labs for discussions; O. Koren, T. Connallon, A. Early, L. Ma, and C. Van Hout for comments on the manuscript; and the Human Microbiome Project for the publically available data analyzed in this study. The work was supported by grant R01 DK093595 to REL and AGC. DG and KH were supported by a grant from the NIH (NIH U54 HG004969). The TwinsUK study was funded by the Wellcome Trust; European Community’s Seventh Framework Programme (FP7/2007-2013). The TwinsUK study also receives support from the National Institute for Health Research (NIHR) BioResource Clinical Research Facility and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust and King’s College London. TDS is holder of an ERC Advanced Principal Investigator award. SNP genotyping of TwinsUK individuals was performed by The Wellcome Trust Sanger Institute and National Eye Institute via NIH/CIDR. This work was carried out in part using computing resources at the University of Minnesota Supercomputing Institute.
- 18.Morton ER, Lynch J, Froment A, Lafosse S, Heyer E, Przeworski M, et al. Variation in rural African gut microbiomes is strongly shaped by parasitism and diet. bioRxiv 2015. DOI: 10.1101/016949
- 20.Tung J, Barreiro LB, Burns MB, Grenier JC, Lynch J, Grieneisen LE, et al. Social networks predict gut microbiome composition in wild baboons. Elife. 2015;4Google Scholar
- 68.PLINK: Whole genome association analysis toolset. Available at: http://pngu.mgh.harvard.edu/~purcell/plink.
- 70.Ingenuity Pathway Analysis. Available at: www.ingenuity.com.
- 71.Innate DB. Available at: www.innatedb.com
- 73.Gene Ontology Consortium. Available at: www.geneontology.org.
- 74.KEGG: Kyoto Encyclopedia of Genes and Genomes. Available at: www.genome.jp/kegg.
- 76.Reactome. Available at: www.reactome.org.
- 77.National Human Genome Research Institute. Available at: www.genome.gov/26525384.
- 78.GTex Portal. Available at: www.broadinstitute.org/gtex/.
- 86.dbPSHP: A database of recent positive selection across human populations. Available at: http://jjwanglab.org/dbpshp.
- 87.NIH Human Microbiome Project. Available at: http://hmpdacc.org.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.