Introduction

The human microbiome has a role in human disease and overall health outcomes1,2. Individual microbiome profiles are unique, although many species are shared2,3. Knowledge of the relationship between the human microbiome and disease may serve as a component of future comprehensive individualized treatment plans4. Studies of the microbiome have typically involved 16S rRNA gene sequencing5, with metagenomic sequencing emerging more recently6.

Relevance of the lung microbiome has been demonstrated in the context of lung diseases7,8,9, including chronic obstructive pulmonary disease (COPD), asthma and idiopathic pulmonary fibrosis (IPF)10,11,12,13,14. In addition, the microbiome has been assessed in healthy lung and COPD exacerbations15,16. These studies have involved both lung tissue17,18 and the airway sampling19,20,21,22,23, with some researchers integrating the microbiome data with host gene expression13,14,17,18,21. Study of the respiratory microbiome presents many challenges24, including the low microbial biomass available in the samples25.

It has historically been believed that peripheral blood does not contain bacteria unless an acute infection was present. Through use of culture-independent sequencing methods, evidence has emerged regarding a possible healthy human blood microbiome26,27,28,29,30. Culture-independent methods in microbiome studies do not provide evidence of whether a blood microbial signature is from transient nucleic acids or from live bacteria31. A blood microbial signature has been found correlated with host disease traits in schizophrenia30, type 2 diabetes28, chronic kidney disease32 and liver fibrosis33, and it may provide a link in other tissues and diseases. Use of the microbiome for disease diagnosis and prediction has proven successful in cancer34. As with the lung microbiome, low biomass is an issue for peripheral blood microbiome studies35. Sequencing of both RNA27,30 and the 16S rRNA gene28,32,33 has been used to study the peripheral blood microbial signature.

In this study, we detected microbial signatures through secondary use of whole blood RNA-sequencing data from large subsets of the COPDGene (Genetic Epidemiology of COPD) study, repurposing sequencing reads not mapped to the human genome27,30,36. An overarching challenge in population-based microbiome studies relates to statistical power, as testing for associations between the detected microbial profiles and variables of interest places demands on sample size. Though samples were not collected as part of a traditional microbiome study, by using a large population and a meta-analysis approach, we had enhanced power to enable findings in the blood, with its typically lower microbial signals. Using statistical tools developed for microbiome analysis, we tested associations between the identified taxa and multiple COPD-related phenotypes available in COPDGene. We used network methods to integrate the microbiome signatures with the human gene expression data to highlight microbial interactions with host pathways. Our goal was to reveal microbial signatures in peripheral blood associated with lung relevant host factors and to observe lung biology relevance. A blood microbiome signature has the potential to serve as a biomarker of disease severity and progression and may inform personalized diagnostic or treatment efforts.

Methods

Study subjects

COPDGene is a longitudinal cohort study that includes non-Hispanic White and African American subjects enrolled at 21 centers across the United States37. All subjects in this study provided written consent for study procedures, including genetic analysis. COPDGene was approved by the Institutional Review Boards at all participating centers. The subjects include more than 10,000 current and former cigarette smokers with a minimum 10 pack-years smoking history, along with a small number of non-smokers. COPD cases have airflow obstruction (FEV1/FVC < 0.7), Preserved Ratio Impaired Spirometry (PRISm) cases have preserved ratio (FEV1 < 80% predicted with FEV1/FVC ≥ 0.7)38 and control subjects have normal spirometry (FEV1% predicted ≥ 80% and FEV1/FVC ≥ 0.7). The five-year follow-up visit included questionnaires, pre- and post-bronchodilator spirometry, volumetric computed tomography (CT) of the chest, and blood drawn for complete blood cell count, RNA-sequencing and biomarker studies. Subjects were at least one month removed from any exacerbation event or acute respiratory infection. Exacerbations were defined by use of antibiotics and/or systemic steroids, and severe exacerbations by emergency department visit or hospital admission39. Details of the RNA-sequencing methods are available in the online supplement40. We performed meta-analyses using a primary set of data and a second independent set of replication data from the COPDGene study.

Microbial detection

Starting from the whole blood RNA-seq data, we used reads that were not mapped to the human genome during the gene expression analysis to detect a bacterial signature. Additional filtering of the unmapped reads was performed using the PathSeq microbial detection pipeline from the Genome Analysis Toolkit (GATK4) and the host reference available from the GATK Resource Bundle41. This filtering addresses any remaining quality, host contamination or repetitive sequence issues. We subsequently used PathSeq to map these cleaned reads to bacterial genomes. The bacterial reference for mapping was created using representative genomes, chromosomes, contigs and scaffolds (277,422 total genomic entries; September 25, 2019) from the National Center for Biotechnology Information (NCBI), and the PathSeq reference creation tools. Taxonomy information for these bacterial genomic data was also obtained from NCBI (RefSeq-release95.catalog.gz). Using these mapping results and taxonomy data, the inferred bacterial abundance profiles in each sample were assembled using PathSeq. Included in these profiling data were the raw read counts, adjusted scores and normalized scores (compositional data from the adjusted scores that represent inferred relative abundance) for taxa within each taxonomic classification (genera and phyla). We used the TMM (trimmed mean of M values) method in the R/Bioconductor package edgeR42 and the RNA-seq gene expression counts from the primary analysis to normalize the PathSeq count data across samples.

Taxa associations

We tested associations between the TMM-normalized abundance for each taxon and host variables using linear models with the R/Bioconductor package MaAsLin2 (Multivariate Association with Linear Models)43. The abundance values were log-transformed prior to testing. With relatively low levels of bacterial genetic content in peripheral blood, the data is inherently sparse and MaAsLin2 is particularly well suited for analysis of such microbial data. The base statistical model included the covariates age, sex, race, pack-years of smoking, smoking status (current vs. former), RNA-seq library preparation batch and study center. Using the results from our primary and replication analyses, we performed a meta-analysis by combining the p-values from these tests using Stouffer’s method via the sumz function from R package metap44. The directions of effect in both the primary and replication analyses were required to be the same for the p-values to be combined. For each of the models, adjustment of the combined p-values for multiple testing controlled for false discovery rate (FDR < 5%). The heatmaps of taxa associations were produced using the labeledHeatmap function from the R package WGCNA45.

Contamination assessment

Nucleic acids from sources other than the peripheral blood of the study subjects could impact the analyses and potentially create a false taxonomic signature. Extraction, amplification and library-preparation kits may contain nucleic acids from water and soil bacteria46. Removing taxa with inferred abundances below a specified threshold was the first step in the process of addressing contamination47. Recent studies have shown that external contaminants more consistently correlate negatively with sample nucleic acid concentration48,49. Therefore, we sought to identify additional contamination by testing the Pearson correlation between taxa abundance and RNA concentration, with a correlation coefficient < -0.4 and p-value < 0.05 demonstrating the conditions for possible contamination47. We also examined the inferred taxa abundances across the processing batches and study centers to identify patterns suggestive of contaminant introduction through laboratory kit reagents. This study did not focus on diversity measures or detection of novel organisms, as these are areas where microbial contamination may be expected to have a greater impact. In addition, our analyses involved testing associations between host binary and quantitative characteristics and microbial taxa abundance. This helps reduce the impact of batch-specific or study-wide contamination, as correlations with host variables are not expected to be consistent and significant. Our meta-analysis in two independent sets of data mitigates the effects of contamination and enhances the ability to detect biologically relevant signatures.

Host microbe interactions

We projected the human gene expression data onto the pathways in the Hallmark gene set collection using gene set variation analysis via the R/Bioconductor package GSVA50. The genes represented in both the gene expression data and the Hallmark gene sets were included in the GSVA procedure (Methods in the online supplement). The Hallmark canonical pathway set reduces redundancy found in public gene sets to enhance enrichment analyses. GSVA output is a pathway-by-subject matrix of expression data for observation of host-microbiome interactions. We used the pathways in this matrix as variables in MaAsLin2 models. Similar to the taxa-association analysis, we performed a meta-analysis by combining the p-values from these tests using Stouffer’s method44. The directions of effect in both the primary and replication analyses were required to be the same. We constructed a bipartite network (edges connecting taxa and pathways) using the results from these models. Communities within this network were identified using the R package CONDOR51. Networks and communities were visualized using the R package igraph52, with the GEM (graph embedder) force-directed layout algorithm.

Ethics statement

All subjects in this study provided written consent for study procedures, including genetic analysis. The study was approved at all clinical centers by the following Institutional Review Boards: National Jewish IRB, Partners Human Research Committee, Institutional Review Board for Baylor College of Medicine and Affiliated Hospitals, Columbia University Medical Center IRB, The Duke University Health System Institutional Review Board for Clinical Investigations (DUHS IRB), Johns Hopkins Medicine Institutional Review Boards (JHM IRB), The John F. Wolf MD Human Subjects Committee of Harbor-UCLA Medical Center, Morehouse School of Medicine Institutional Review Board, Temple University Office for Human Subjects Protections Institutional Review Board, The University of Alabama at Birmingham Institutional Review Board for Human Use, University of California San Diego Human Research Protections Program, The University of Iowa Human Subjects Office, VA Ann Arbor Healthcare System IRB, University of Minnesota Research Subjects’ Protection Programs (RSPP), University of Pittsburgh Institutional Review Board, UT Health Science Center San Antonio Institutional Review Board, Health Partners Research Foundation Institutional Review Board, Medical School Institutional Review Board (IRBMED), Minneapolis VAMC IRB, and Institutional Review Board/Research Review Committee Saint Vincent Hospital – Fallon Clinic – Fallon Community Health Plan. The research methods were carried out in accordance with the relevant guidelines.

Ethics approval and consent to participate

All subjects in this study provided written informed consent. COPDGene was approved by the Institutional Review Boards at all participating centers.

Clinical center

Institution title

Protocol number

National Jewish Health

National Jewish IRB

HS-1883a

Brigham and Women’s Hospital

Partners Human Research Committee

2007-P-000554/2; BWH

Baylor College of Medicine

Institutional Review Board for Baylor

College of Medicine and Affiliated Hospitals

H-22209

Michael E. DeBakey VAMC

Institutional Review Board for Baylor College of Medicine and Affiliated Hospitals

H-22202

Columbia University Medical Center

Columbia University Medical Center IRB

IRB-AAAC9324

Duke University Medical Center

The Duke University Health System Institutional Review Board for Clinical Investigations (DUHS IRB)

Pro00004464

Johns Hopkins University

Johns Hopkins Medicine Institutional Review Boards (JHM IRB)

NA_00011524

Los Angeles Biomedical Research Institute

The John F. Wolf, MD Human Subjects Committee of Harbor-UCLA Medical Center

12756–01

Morehouse School of Medicine

Morehouse School of Medicine Institutional Review Board

07–1029

Temple University

Temple University Office for Human Subjects Protections Institutional Review Board

11369

University of Alabama at Birmingham

The University of Alabama at Birmingham Institutional Review Board for Human Use

FO70712014

University of California, San Diego

University of California, San Diego Human Research Protections Program

070876

University of Iowa

The University of Iowa Human Subjects Office

200710717

Ann Arbor VA

VA Ann Arbor Healthcare System IRB

PCC 2008–110732

University of Minnesota

University of Minnesota Research Subjects’ Protection Programs (RSPP)

0801M24949

University of Pittsburgh

University of Pittsburgh Institutional Review Board

PRO07120059

University of Texas Health Sciences Center at San Antonio

UT Health Science Center San Antonio Institutional Review Board

HSC20070644H

Health Partners Research Foundation

Health Partners Research Foundation Institutional Review Board

07–127

University of Michigan

Medical School Institutional Review Board (IRBMED)

HUM00014973

Minneapolis VA Medical Center

Minneapolis VAMC IRB

4128-A

Fallon Clinic

Institutional Review Board/Research Review Committee Saint Vincent Hospital – Fallon Clinic – Fallon Community Health Plan

1143

Consent for publication

Not applicable.

Results

After quality control procedures, RNA-seq data were available for 2,647 samples from current and former smokers from the COPDGene five-year follow-up visit. Approximately two-thirds of subjects were former smokers and twenty-five percent were African American (Table 1). There were slightly more males than females and the average age of these subjects was 65.5 years. The overall disease burden in the population was summarized in Table 1 by a comorbidity index (range 0 to 14, mean = 2.97 and standard deviation = 1.98)53. We performed microbial signature profiling using PathSeq and excluded 57 samples with outlying unmapped read counts (Methods in the online supplement). We then visualized the inferred relative abundance profiles and tested host associations for these 2,590 subjects (Fig. 1). Ordered by mean normalized score from PathSeq, the four taxa observed at the phylum level above an abundance-filtering 1% threshold across all subjects were Proteobacteria, Actinobacteria, Firmicutes, and Bacteroidetes. In the abundance plot of the normalized scores for these four phyla, ordered by RNA-seq library batch and study center, we observed consistent taxon distributions across the batches and study centers (Figures S1-S2 in the online supplement). Twenty genera had mean normalized scores that eclipsed the 1% threshold chosen to remove low-level contamination. We observed batch specific contamination profiles (Figures S3-S10) for eight genera (Flavobacterium, Pseudomonas, Methylobacterium, Methyloversatilis, Streptomyces, Methylorubrum, Ralstonia and Nevskia). All of these genera are known possible contaminants24,46 and were excluded from the analyses. We also sought to identify remaining contamination by observing the relationship between inferred abundance and nucleic acid concentration using the computation approach outlined in Methods. We again identified the aforementioned genus Methyloversatilis (correlation coefficient = -0.44 and p < 0.0001) as a possible contaminant.

Table 1 COPDGene study subjects.
Figure 1
figure 1

Overview of the study design illustrating the sequencing, statistical and gene enrichment framework. This illustrates the integration with host characteristics and gene expression for observations of host microbiome interaction (GATK = Genome Analysis Toolkit; MaAsLin2 = Multivariate Association with Linear Models).

Genera abundance and host phenotype

We normalized the taxa counts at the genus level from PathSeq using the TMM method. We created a summary of the reads from the gene expression and PathSeq analyses for each of the 12 taxa (Table S1 in the online supplement). Using the TMM-normalized taxa abundances, we created a heatmap with clustering of samples in the columns by Bray–Curtis dissimilarity (Figure S11 in the online supplement). In the color coded tracks for BMI, race, sex, library preparation batch, study center, COPD status and smoking status, we observed visual clustering only by batch (grouping of samples from the same batch). A variable for library batch was included as a covariate in the statistical models to mitigate batch effects and reduce spurious findings. We tested associations between the TMM-normalized abundances for each taxon at the genus level and host phenotype, exposure, treatment and trait variables using linear models with MaAsLin2 (Table S2 in the online supplement). We summarized the findings in a heatmap of the p-values and effect sizes (Figure S12 in the online supplement).

Using an independent replication set of 1,065 samples from the COPDGene five-year follow-up visit (Table S3 in the online supplement), we detected microbial signatures using PathSeq and normalized the taxa counts at the genus level using the TMM method (Table S1 in the online supplement). Contamination was not observed in these data for the 12 taxa using the same methods as in the initial dataset. We performed association tests using the models and methods from the primary analysis and the TMM-normalized taxa abundances for the 12 taxa in the replication set. We summarized the findings in a heatmap of the p-values and effect sizes (Figure S13 in the online supplement).

Meta-analysis

The p-values from the primary and replication analyses were combined for each of the association tests using Stouffer’s method requiring the directions of effect be the same. A heatmap was created to summarize the meta-analysis results (Fig. 2, Figure S14 in the online supplement) with the color intensity indicating significance (negative log transformed q-values) and gray or blue shading indicating the effect direction. Scatter or box plots of the model residuals of the inferred TMM abundance for the significant (FDR < 5%) meta-analysis findings were created in the primary (Figure S15 in the online supplement) and replication (Figure S16 in the online supplement) sets of data to illustrate the relationships between taxa abundance and the variables of interest.

Figure 2
figure 2

Heatmap of the associations between genera inferred abundance and host-related variables for the meta-analysis. Variables with at least one finding with FDR < 10% were included. The value in each cell is the adjusted q-value. The color scale for the cells represents the sign of the effect multiplied by negative log10 of the q-values, with intensity proportional to significance and gray shading representing positively correlated associations and blue shading representing negatively correlated associations. Results with discordant directions of effect in the meta-analysis are set to q = 1 (white) (heatmap produced using the labeledHeatmap function from the R package WGCNA45). Variables with at least one significant association are included (WBC = white blood cell count, Lymphocytes = lymphocyte count, NeutroLymph_Ratio = ratio of neutrophil counts to lymphocyte counts, Lymphocyte_pct = percentage of lymphocytes, Neutrophil_pct = percentage of neutrophils, 6 MW = six-minute walk distance, mMRC = Modified Medical Research Council dyspnea score, COPD (case–control) = COPD cases vs. controls, PackYears = pack-years history of smoking, Smoking (current-former) = current vs. former smoking status).

From the meta-analysis (Fig. 2), we observed associations between smoking status (current vs. former) and Acinetobacter (q = 0.017), Serratia (q = 0.0057) and Cutibacterium (q = 0.017) abundance. Two measures of functional capacity (6-min walk distance and mMRC dyspnea scale) were associated with at least one taxon. Acinetobacter (q = 0.042), Serratia (q = 0.0093), Streptococcus (q = 0.042) and Bacillus (q = 0.048) abundances were associated with mMRC, with a higher dyspnea score corresponding to higher bacterial abundance. Serratia (q = 0.042) abundance was associated with 6-min walk distance (6 MW), with higher bacterial abundance corresponding to lower 6-min walk distances. All 12 taxa were associated (q < 0.05) with at least one white blood cell distribution variable. Neutrophil levels and bacterial abundance were positively correlated. Conversely, lymphocyte levels were negatively correlated with abundance. Abundance for nine of the 12 taxa was associated (q < 0.05) with sex, with lower bacterial abundance in males. Seven of the 12 taxa were associated with race, with bacterial abundance lower in non-Hispanic white participants.

Host-microbiome interactions

We sought to highlight host-microbiome interactions using microbial abundance profiles and host gene expression pathways. We created a matrix of pathway expression for the Hallmark sets from MSigDB using the R/Bioconductor package GSVA and the human blood RNA-seq data in both the primary and replication data. We tested the association between TMM-normalized taxa abundance and host pathways in both sets of data for each of the 12 genera using models, adjusting for age, sex, race, pack-years of smoking, current smoking status (vs, former), library prep batch and study center. The associations across all taxa and pathways were summarized for both sets of data in a heatmaps (Figures S17 and S18 in the online supplement). The p-values from the primary and replication analyses were combined for each of the association tests using Stouffer’s method requiring the directions of effect be the same and a heatmap was created to summarize the results (Figure S19 in the online supplement). We used network methods to visualize the large set of significant findings. We constructed a bipartite network using the significant (FDR < 5%) associations as edges (edge weights = -log10(p-value)) between taxa and pathways (Figure S20 in the online supplement). Using CONDOR (see Methods), we identified three communities within this network (Figures S21 and S22 in the online supplement) with one of particular relevance to our taxa-association findings (Fig. 3). This community has six genera (Streptococcus, Cutibacterium, Corynebacterium, Lactobacillus, Staphylococcus, and Bacillus) and 15 host pathways, including WNT BETA CATENIN SIGNALING, MTORC1 SIGNALING, and OXIDATIVE PHOSPHORYLATION . Within these communities we observe clustering of genera with shared pathway associations, suggesting joint influence on the host processes.

Figure 3
figure 3

Community from the bipartite network from the host-microbiome interaction analysis with relevance to COPD, dyspnea and smoking associations. Edges represent a significant (FDR < 5%) association between genus abundance (blue circles) and the expression of the human Hallmark pathway (red squares) in the meta-analysis (figure produced using the R package igraph52).

Discussion

We re-purposed peripheral blood RNA-sequencing data in a large sample set from the COPDGene Study. Using RNA-sequencing reads that did not map to the human genome, we identified microbial signatures at both the phylum and genus levels. We tested associations between inferred abundance and host-related variables. At the phylum level, we identified Proteobacteria, Actinobacteria, Firmicutes, and Bacteroidetes. Recent studies using both 16S rRNA gene sequencing and unmapped human RNA-seq data have shown that peripheral blood typically includes a nucleic acid signature of these phyla26,27,35.

Taxa associations

Detection at the genus level produced a larger set of taxa, with all 12 taxa significantly associated with at least one host-related variable. Eight of the genera had at least six significant findings. For the associations between taxa abundance and white blood cell composition, we observed positive correlation for neutrophil percentage and neutrophil-to-lymphocyte ratio. Although the role of neutrophils in the establishment of the microbiome can be complex54 the positive correlation is plausible given the role of neutrophils in the defense against bacterial infections. We observed a positive association between the genera Acinetobacter and Streptococcus and mMRC dyspnea score. Acinetobacter is a known cause of acute exacerbations and lung infections55,56,57. Acinetobacter airway abundance may also be a marker of outcome for critically ill COPD patients58. Streptococcus pneumoniae is a common cause of respiratory infections and has been observed in the airway of patients with exacerbations57 and has been isolated from sputum samples in COPD patients in both a stable and an exacerbation state59. The abundance of Serratia and Bacillus was also associated with mMRC dyspnea score. Although Serratia and Bacillus species are less frequently associated with lung infections, Serratia has been identified in patients with exacerbations of COPD60,61. Bacillus was isolated from the lung of stable COPD subjects62 and subjects with more variable microbiomes during a longitudinal study of sputum in COPD63. In the study by Bouquet et al.63, microbiota variability corresponded to higher exacerbation frequency and frequent viral infections in stable COPD. The association between Serratia abundance and six minute walk distance highlights another association with relevance to pulmonary functional capacity and outcomes in COPD64.

Acinetobacter, Serratia and Cutibacterium abundance was associated with current smoking status, compared to former smokers. Species in the Acinetobacter and Serratia have been identified in cigarettes65 providing a possible mechanism for introduction of these taxa, though an explanation for higher abundance in the peripheral blood of former smokers is not apparent at this time. Community acquired Acinetobacter infections, including bacteremia, were also found more in patients with a history of heavy smoking66. Cutibacterium species are members of the upper respiratory tract microbiome67,68 and although smoking has an impact on the microbiome of the upper respiratory tract69,70 evidence regarding the influence of smoking on Cutibacterium is lacking. Irrespective of individual taxa, the impact of smoking on bacterial infections and the microbiome are complex71,72, particularly in the context of COPD73,74. Together, this information suggests relevance for the identified taxa in the lung microbiome and respiratory infections with possible implications in chronic or persistent dyspnea and inflammation.

Further efforts will be required to determine whether these associations in peripheral blood highlight cross-tissue mechanisms similar to the immunomodulatory effects observed in the gut-lung axis75,76, or perhaps similar to interactions or microbial translocations observed between liver and gut in liver disease77. Despite any direction of effect ambiguity, together these findings suggest we may be capturing lung disease relevant microbial signatures in peripheral blood.

The associations between nine taxa and sex are supported by previous findings regarding sex-specific microbiome characteristics in the gut78,79. Previous studies highlighted sex differences with respect to bacterial infections, including respiratory infections80, and relevance in the relationship between airway microbiome and asthma81. Likewise, gut microbiota diversity may vary across ethnicity82,83, supporting our taxa abundance associations with race. The associations between blood taxa and both sex and race may provide insight into systemic host bacterial responses and inform development of personalized therapeutics.

Host-microbiome networks

We leveraged the human RNA-seq data from the same samples to explore host-microbiome interactions using network methods for significant taxa and host pathway associations. Within the communities of the bipartite network, genera with common pathway associations were clustered, providing insight into shared influence on the host processes. For one particular community within the bipartite network (Fig. 3), we observed clustering of Streptococcus (associated with mMRC dyspnea score) with Cutibacterium (associated with current smoking status) through several host pathways, including OXIDATIVE PHOSPHORYLATION, WNT BETA CATENIN SIGNALING, and MTORC1 SIGNALING. Pathways in Fig. 3 are involved in aspects of COPD. In regards to oxidative phosphorylation, mitochondrial reactive oxygen species production and mitochondrial dysfunction are believed to have a role in the development of lung diseases including COPD84, with implication in exercise capacity85. It has been suggested that cross-talk between the bacterial microbiome and mitochondria is a component of overall microbiome interactions with the host86.

The mTORC1 signaling pathway has been implicated in lung cell senescence and emphysema87 and is involved in airway inflammation88 and development of corticosteroid resistance driven by cigarette smoke89. Having a prominent role in regulation of immune responses90,91, the mTOR pathway, in particular, responds to environmental changes and regulates intracellular processes92. The mTOR pathway may have a role in determining the composition of the gut microbiome93,94.

Airway down-regulation of the Wnt/beta-catenin pathway has been observed in smokers95, suggesting a role in the development of smoking-related airway disease and airway inflammation in COPD96. With a role in cell proliferation and cellular morphology97, Wnt/beta-catenin signaling is a process bacterial pathogens may exploit to better establish infection98, providing a possible target for future antimicrobial therapeutics99. Although the microbial signature observed in our study does not appear to be pathogen-specific, both the establishment and maintenance of the bacterial microbiome and the regulation of a host pathogen defense involve a shared complex relationship with host immune responses100.

Together, these findings suggest we have detected a systemic blood signature of host-microbe interactions with pathogenic relevance and perhaps linked to the COPD-relevant associations we identified. This bipartite network approach demonstrates a versatile method for observation of these host-microbiome interactions. The approach is similar to previous airway host-microbiome interaction studies, though focused on a knowledge-based pathway approach instead of unsupervised dimensionality reduction of gene expression data using principal component analysis (PCA)21. The edges in this network may highlight taxa with shared interactions or influence on host biological processes. Both the blood microbial signatures and the structure of these host interactions may inform patient stratification or personalized medicine efforts related to COPD and exacerbations. These efforts could involve particular host pathway or gene targets, identified by their relationship to COPD-relevant microbial taxa using these methods.

Limitations

There are several limitations to the current study. In this secondary analysis of blood RNA-sequencing data, we are capturing RNA from bacterial genes. These mapped reads are serving as a proxy for abundance. Future studies involving 16S rRNA gene or whole-genome shotgun sequencing in parallel with the host transcriptome analysis will provide further insight into the blood microbial signatures. Although the existence of a healthy blood microbiome remains a subject of debate35, we have replicated taxa from previous blood microbiome 16S rRNA gene and RNA sequencing studies26,27,30, demonstrating the generalizability of this approach. Future metagenomic studies with concurrent blood and lung or airway samples, perhaps in a longitudinal context, will be required to determine to what extent peripheral blood recapitulates the lung microbiome. This may also reveal mechanisms responsible for overlapping microbial signatures, such as bacterial translocation, and further identify any transient behavior of these signatures. Given the relatively small effect sizes, the applicability of these findings in a clinical context will be considered in future studies. The sequencing data from this study was not obtained for use in a microbiome study. Therefore, specific bacterial contamination mitigation procedures were not included in the COPDGene protocol, beyond sterile blood acquisition. We assessed for contamination using visual inspection of our data and statistical testing, and we excluded taxa with any potential evidence of contamination. A replication dataset was included to ensure validity of our results. In future studies, protocols involving the inclusion of negative controls and treatment of kit reagents to reduce contaminating nucleic acid content and other measures will help to address the issue of sample contamination24,46.

Conclusions

In this study of the blood microbiome, we were able to identify COPD-relevant bacterial signatures in a secondary analysis of peripheral blood RNA-seq data from a large cohort of smokers. Analyses at the genus level found associations between blood microbial signals and multiple COPD-relevant traits. Using a network approach on the paired human RNA-seq and microbial datasets, we identified host transcriptomic pathways linking multiple taxa, highlighting a useful method for future studies of the human microbiome and transcriptome. Together these findings demonstrate that the peripheral blood microbial signature and host-microbiome interactions may have the potential to capture relevant lung microbiome features and biology. This study provides an initial step toward discovery of composite blood biomarkers for use in predictive disease models to inform personalized treatments of chronic smoking-related diseases.