Background

Recent microbiome-wide association studies (MWASs) have uncovered that microbiome plays a crucial role in human health and disease [1,2,3,4], with linkage of microbiota dysbiosis to a variety of complex diseases, including diabetes, cardiovascular and mental disease, and cancer [5,6,7,8,9,10,11,12]. These studies provide great opportunities to study microbiome’s role in disease prediction, which, however, is challenging because of its unique data structure.

Rapid advances in high-throughput sequencing technologies identify diverse microorganisms in a single sample by targeted sequencing of their unique 16S rRNA gene, or shotgun sequencing of the collective genomes of all microbes. For 16S rRNA sequencing data, QIIME 2 [13] is commonly used to assign the sequencing reads to amplicon sequence variants or clustered operational taxonomic units based on the similarity of sequences. For shotgun sequencing data, MetaPhlAn [14] or StrainPhlAn [15] can be used to map the sequencing reads to species/strains against a reduced set of clade-specific marker sequences. Either method produces the count or relative abundance table which typically contain hundreds to thousands of taxonomic or functional features, i.e. microbiome data are high-dimensional, especially compared to the available number of samples in most existing studies. In addition, these feature tables are usually sparse with excessive zero counts, compositional with a sum constrained to a constant, and heterogeneous with a phylogenetic tree structure to reveal the evolutionary relationship among the taxa. How to deal with these unique characteristics of microbiome data and effectively utilize them in predicting disease risk is challenging and needs comprehensive explorations and validations.

Polygenic risk score (PRS), a continuous score of an individual's genetic liability to a complex disease or phenotype, has become more routine and powerful in current genomic research [16, 17]. PRS aggregates the results from genome-wide association studies (GWASs) and is defined as the sum of risk alleles linked to a phenotype of interest weighted by the corresponding effect sizes. The construction of PRS involves two key steps: determining the risk alleles and their effect sizes based on discovery samples or published GWASs, and calculating the PRS for each subject in the target population. The PRS framework motivates us to construct a similar microbial risk score (MRS) to summarize the disease-specific microbial profile in the increasing large-scale population-based microbiome studies [11, 18, 19] and to investigate its potential in disease prediction. However, microbiome’s unique community features make MWASs differ from GWASs. First, the microbiota is a complex ecosystem, whose dynamics are driven by the interactions among microbes and between microbes and their host. The link between this complex ecosystem and disease process often involves interwoven mechanisms [20]. Further, the microbiota is composed of various sub-communities related to different traits [21, 22], and its influence on disease development may act at the community rather than the single-microbe level [23]. Thus, it is less informative or efficient to simply define MRS as the weighted sum of the relative abundances of the associated microbes. Instead, we propose a community-based MRS by calculating alpha diversity on a sub-community with member taxa identified as being associated with the study trait. Alpha diversity is the diversity in a single ecosystem or sample with respect to its richness, evenness, or both characteristics [24, 25]. Several indices, including Observed, Simpson, Shannon, and Faith's phylogenetic diversity (PD), have been extensively used to characterize microbial community. With the NYU Langone Health (NYULH) COVID-19 cohort [26] and the gut microbiome health index (GMHI) multi-study cohort [27], we propose and validate a few MRSs on COVID-19, colorectal adenoma (CA), colorectal cancer (CC), Crohn’s disease (CD), and rheumatoid arthritis (RA) to exhibit the utility of the proposed MRS framework.

With the recent advances in the next-generation sequencing and mass spectrometry, there is a growing need for the ability to merge biological features to study an ecosystem as a whole. Aspects such as the metagenome, metatranscriptome, host genome, host gene expression, and metabolome provide a snapshot of one level of regulation in a system. The proposed MRS framework provides a simple and interpretable approach to integrate the microbial profiles with other biological omics data and elucidate the microbial interactions with other omics datasets in the disease prediction. We use the NYULH COVID-19 cohort, which characterized the lung microbiome in a large prospective cohort of critically ill patients with SARS-CoV-2 infection who required invasive mechanical ventilation, to illustrate, evaluate, and validate the proposed MRS and its integrations with other omics data in the prediction for COVID-19 mortality. In addition, we elucidate the join effect of MRS and PRS on T1D risk stratification using the Environmental Determinants of Diabetes in the Young (TEDDY) study (https://teddy.epi.usf.edu/) [28,29,30].

Methods

MRS framework

MRS workflow

We propose a microbial risk score framework to convert the high-dimensional microbiome data into a summarized risk score that can measure and predict disease susceptibility. As illustrated in Fig. 1, with the ready-for-downstream-analysis microbial data, the microbial risk score algorithm involves two key steps: (1) to identify a sub-community consisting of the signature microbial taxa associated with disease and (2) to integrate the identified microbial taxa into a continuous score.

Fig. 1
figure 1

The workflow of the microbial risk score (MRS) framework. Data input: a phyloseq-class object is needed, which consists of a feature table (observed count table), a sample metadata, a taxonomy table (optional), and a phylogenetic tree (optional). MRS Algorithm has two steps: Step 1 is to identify a sub-community consisting of the signature microbial taxa with the P+T method and AUC evaluation in the discovery cohort. The black ROC curve which has the largest AUC determines the optimal p value cutoff. Step 2 is to integrate the identified microbial taxa into a continuous score, i.e., calculate the MRS value for each sample by calculating the diversity of the identified sub-community with the Shannon index. In addition, the constructed MRS is independently validated in the validation cohort. Application: In this manuscript, we perform multi-omics data integration for disease prediction by jointly modeling the proposed MRS and other risk scores constructed from other omics data in two real data cohorts

Microbial signature identification

We propose to employ the existing sophisticated microbial association tests [1, 3, 4, 31,32,33] to identify microbial taxa associated with disease using the discovery samples. Great amount of abundance-based methods examining the difference of microbial abundance directly, which is also called differential abundance (DA) analysis [31,32,33,34,35,36,37,38,39] have been proposed recently. Based on the results in two recent benchmarking works [32, 33], ANCOM-BC (analysis of compositions of microbiomes with bias correction) [31] is one of the top-performing methods and has been widely used in microbiome research. ANCOM-BC [31] models the observed abundances using an offset-based log-linear model, in which the offset term is sample-specific to account for sampling fraction. We use it as the default microbial association test to identify the candidate taxa in the first step of our microbial risk score algorithm. Considering developing novel differential abundance test is still an active area of research, in the Discussion section, we discuss the performance of the proposed MRS framework with other DA tests.

In addition to the above-mentioned statistical methods, a variety of machine learning (ML) techniques have been applied in microbiome studies for microbial feature selection, biomarker identification, disease prediction, and classification, as recently reviewed [40]. As an example, Gou et al. [41] defined an MRS with the microbiome features selected by the light gradient boosting machine method [42] and examined its association with type 2 diabetes (T2D) as well as T2D-related traits. Despite the visible contributions of characterizing the microbial profiles and uncovering the relationship between microbiome and disease, the applications of ML methods including traditional methods and deep learning techniques in the microbiome studies share several drawbacks [40]. One is that most ML methods input all available microbial features into the model to determine the final output solely based on algorithms, without considering the inherent structure of microbiome data, such as compositionality and zero inflation. Another unavoidable drawback of ML methods is the model instability in the relatively small-scale biomedical human studies [43]. Because the nature of ML algorithms is to learn the pattern by training the data, they usually require a large sample size to reach stable results, especially for the algorithms involving various parameters or various layers that need to be trained via cross-validation (CV). Given these common pitfalls and relatively small sample size in biomedical studies due to the high cost of patients’ in-person visit, sample collection, and sequencing, ML’s application in microbiome research may provide inexplicable results and even lead to the loss of statistical power. With the NYULH COVID-19 cohort example, we illustrate the inefficient utility of ML methods in analyzing the microbiome data compared to the proposed MRS method. The details are reported in the Results section.

Sub-community determination

Pruning and thresholding (P+T) method is a heuristic approach commonly used in PRS studies for identification of genetic variants based on an empirically determined p value threshold [44]. We propose to use the P+T method to determine the final candidate microbial taxa in discovery cohort. Specifically, we calculated a series of MRSs proposed below using the nested sets of microbial taxa with the increasingly relaxed significance thresholds. We set the final threshold at the value that produced the largest area under the receiver operating characteristic (ROC) curve (AUC). All the taxa whose p values are less than the final threshold form a disease or trait-specific sub-community. If there is only one dataset available, CV will be used to determine the sub-community along with the P+T method. More details are provided in the Results section.

MRS calculation

We propose an MRS, denoted by MRSα, which is defined as the alpha diversity of the sub-community consisting of the identified candidate taxa. Alpha diversity is the diversity in a single ecosystem or sample with respect to its richness, evenness, or both characteristics [24, 25]. The core concept of alpha diversity index in biology is to find the effective number of elements of a system to measure its complexity or diversity [45]. Note that multiple alpha diversity indices are available. Some measure species richness such as observed index, Chao1, and ACE. PD is a phylogenetic metric which is defined as the sum of the lengths of all those branches on the tree that span the members of the set. Simpson index is a dominance index as it gives more weight to the common or dominant species and does not account for species richness. While Shannon index is an information statistic index (entropy) which accounts for both species richness and its evenness in a community or sample, and it has a unique ability to weigh taxa by their frequency, without disproportionately favoring either rare or common elements. As the most popular and accepted index for diversity [46], we adopt Shannon index in the proposed MRSα. Other indices are also investigated in the Discussion section and included in the MRS framework (MRS R package).

Suppose there are n samples (each sample represents one ecosystem or microbial community) and Q taxa. Let Mij be the relative abundance of the jth taxon in the ith sample with the constraint \(\sum_{ij=1}^Q{M}_{ij}=1\), i = 1, …, n, and j = 1, …, Q. Assume p (<Q) taxa are identified as a sub-community to construct MRSs. Without loss of generality, we assume that the first p taxa are the identified candidate taxa.

For the ith sample, its MRSα is calculated as \(\mathrm{MR}{\mathrm{S}}_{\alpha}^i=\sum_{j=1}^p{\overset{\sim }{M}}_{ij}\ln \left({\overset{\sim }{M}}_{ij}\right)\), where \({\overset{\sim }{M}}_{ij}\) is relative abundance of the jth identified candidate taxon within the sub-community for the ith sample (\({\overset{\sim }{M}}_{ij}=\frac{M_{ij}}{\sum_{j=1}^p{M}_{ij}}\)). MRSα is constructed based on the Shannon index [24, 25] without the negative sign, so that the smaller is MRSα, the healthier is the microbial community [47]. As a comparison, we also derive a standard MRS as an analogy to PRS, denoted by MRSS. It is a (weighted) sum of relative abundances of the identified candidate taxa as \(\mathrm{MR}{\mathrm{S}}_S^i=\sum_{j=1}^p{w}_j{M}_{ij}\), where wj is the weight for the jth taxon. We propose two sets of weights: all weights are equal to 1 (denoted by \(\mathrm{MR}{\mathrm{S}}_{unwS}^i\)), and the weights are the effect sizes estimated from the training or discovery samples by certain microbial association method (denoted by \(\mathrm{MR}{\mathrm{S}}_{wS}^i\)). Noticeably, MRSα integrates p identified taxa as a community by measuring its diversity, while MRSS focuses on the additive effect of the identified taxa and does not account for the microbial community feature.

Validation

The proposed MRSs need to be validated either by external validation or internal validation. Since the GMHI multi-study cohort [27] has independent discovery and validation cohorts, the MRSs are created using the discovery cohort and validated using the validation cohort. For the NYULH COVID-19 [26] and TEDDY studies [28,29,30], due to the lack of independent additional samples, we employ CV to perform independent internal validation.

Risk score-based multi-omics data integration

Note that the proposed MRS summarizes a complex microbial profile into a quantifiable score, which provides a fast and flexible way to integrate microbiome data with other omics data to better predict disease risk. Both the NYULH COVID-19 and TEDDY studies contain not only microbial profile data, but also other omics data. We propose to jointly model MRS and other risk scores built on other omics data to further improve the performance of disease prediction. In the COVID study, on the one hand, the enrichment of SARS-CoV-2 and some oral commensals in the lower-airway microbiota are associated with poor outcome, and on the other hand, host lower-airway immune phenotypes reveal a failure of adaptive and innate immune response to SARS-CoV-2 among deceased subjects. Jointly modeling these omics profiles can improve the predictive accuracy of mortality. For the TEDDY study, since that genotype data in the regions containing autoimmunity and inflammatory response genes are available, one can build a PRS for each subject using the existing PRS algorithms [48,49,50]. By combining the PRS and the proposed MRS, we can jointly model the association of genetic and environmental risk in T1D prediction.

Prediction performance evaluation

With the constructed risk scores from various omics data, one can employ a logistic regression model for the prediction of disease status (binary outcome) or a Cox proportional-hazards model [51] for the prediction of disease onset (survival outcome). Predication performance can be evaluated by AUC for binary outcome or by hazard ratio (HR) for survival outcome. The additive model can be used to integrate multiple risk scores in these two regression models. The interaction terms between scores can be explored further for risk stratification [52], as illustrated in the TEDDY study in the Results section.

NYULH COVID-19 cohort

The NYULH COVID-19 cohort [26] includes a subset of 142 patients with COVID-19, at the NYULH Manhattan campus from March 3 to June 18, 2020, who required invasive mechanical ventilation and underwent bronchoscopy for airway clearance and/or tracheostomy. Among all patients, 108 (76%) survived hospitalization and 34 (24%) deceased. The study has collected and processed lower-airway samples and performed: (a) metagenomic sequencing for bacterial, fungal, and DNA viral genomes and (b) metatranscriptome assays for viral, bacterial, fungal, and human transcriptomes and the RNA virome. In addition, comprehensive demographic, longitudinal clinical, and treatment data are available.

GMHI multi-study cohort

An integrated dataset of 4,347 human stool metagenomics samples (cross-sectional) from 34 published studies (discovery cohort) and an independent dataset of 679 samples from 9 additional studies (validation cohort) are publicly available [27]. Both cohorts consist of healthy subjects and patients with various diseases. Using these two cohorts, Gupta et al. [27] introduced and validated the gut microbiome health index (GMHI) to quantify the likelihood of disease presence based on subject’s gut microbiome data. In both cohorts, they pooled samples from different disease conditions together into one nonhealthy group, and the proposed GMHI exclusively identifies the difference of microbiome profile between healthy and non-healthy samples. After the pre-processing and quality control, there are 2,636 healthy and 1,711 nonhealthy samples in the discovery cohort and 118 healthy and 561 nonhealthy samples in the validation cohort, respectively. Among nonhealthy samples, discovery and validation cohorts both have samples from patients with CA, CC, CD, and RA. Sample sizes are shown in Table S1. For microbiome data, there are 313 species and 576 species in the discovery and validation cohorts, respectively, available for analysis. More details are described in Gupta et al. [27].

TEDDY cohort

TEDDY is a large-scale prospective study designed to identify the genetic and environmental triggers that cause childhood T1D [28,29,30]. Children with high-genetic risk for islet autoimmunity or T1D were enrolled, and multiple biomarkers were assessed longitudinally for prediction of the T1D development. A total of 12,005 fecal samples from 903 children, collected from 3 to 46 months of age, were characterized by 16S rRNA sequencing. Of this cohort, 114 children were ascertained to T1D by year 5 [29]. The findings in the previous TEDDY publications [53, 54] focus exclusively on the microbiome profiles and suggest that the gut microbiome data may have the potential to predict the progression of T1D. In addition to microbiome data and metadata, the TEDDY cohort also includes genomic, longitudinal metabolomic, and host transcriptomic data which together provide opportunity to explore the integrated information from multiple aspects on the pathogenesis of T1D through the multi-omics analysis.

Results

Evaluation and validation of MRS framework

NYULH COVID-19 cohort

With the same quality control, sequencing process, and filtering criteria described in Sulaiman et al. [26], we analyzed data from 118 patients (28 deceased and 118 alive) who had all metagenome, metatranscriptome, and host transcriptome samples. We included 374 taxa in metagenome, 1,149 taxa in metatranscriptome, and 14,697 genes in host transcriptome data for our analyses. We used the binary outcome (deceased vs. alive) to illustrate the predictive performance of MRS here.

Figure 2 presents the optimal p value thresholds (0.42, 0.38, and 0.02) used to identify the associated taxa in MRSs (MRSα, MRSwS, and MRSunwS, respectively) using the metagenomic data. The optimal thresholds were determined by P+T method as described in the sub-section “Sub-community determination” using the leave-one-out CV. With the optimal p value cutoffs, the community-based MRSα has the best performance in predicting deceased/alive status (AUC=0.74), compared to two summation-based standard MRSs: MRSwS (AUC=0.72) and MRSunwS (AUC=0.70). This reflects that analyzing the microbial profile as a community can characterize more microbial information and work better than analyzing microbes individually. Additionally, MRSwS performs better than MRSunwS, as expected, since MRSwS incorporates the strength of the association effects of taxa on the outcome, as well as the microbial relative abundances, while MRSunwS is just the summation of the microbial relative abundances from the selected taxa.

Fig. 2
figure 2

The optimal p value thresholds by P+T method for including taxa in MRSα, MRSwS, and MRSunwS, separately, using the metagenomic data in the NYULH COVID-19 cohort. Specifically, given a cut-off, the taxa with p values less than the cut-off were selected and defined as a sub-community. The p values were obtained by ANCOM-BC method. The leave-one-out CV was used for the predictions. MRSα: the negative alpha diversity (Shannon index) was calculated for each sample on the selected sub-community; MRSwS: the weighted sum of relative abundances of the selected taxa with the weights being the coefficients estimated from the ANCOM-BC log-linear model; MRSunwS: the sum of relative abundances of the selected taxa

Figure S1 shows the prediction performance for various ML algorithms which have been commonly applied in microbiome research [40]. The leave-one-out CV was used for the predictions, and the predicted probability for deceased/alive status was used for ROC analysis. All ML algorithms have lower AUCs than the proposed MRSα. Among these ML algorithms, the ML algorithms based on regularization (Figure S1A) all perform better with higher AUCs, compared to the ML algorithms that have various tuning parameters or layers (Figure S1B). Elastic-net logistic regression and penalized discriminant analysis (regression-based) algorithms have the best prediction performance. On the other hand, ML algorithms were also applied to select the candidate taxa used for the construction of MRSα based on the variable importance. The top K features were determined based on leave-one-out CV. Taking the elastic-net logistic regression which has the best prediction above for example, the top 30 taxa were ultimately selected to construct MRSα with the AUC being the largest based on CV, and its AUC for deceased/alive status prediction is 0.66, which is 11% lower than the AUC of the above MRSα. The efficiency of ML algorithms is evidently limited due to the small sample size and not being able to take care of the unique features of microbiome data, such as compositionality and zero inflation.

In addition, we checked the prediction performance of the alpha diversity indices on the whole microbial community in terms of AUC. Table S2 reports the AUC values for six common alpha diversity indices in predicting alive and deceased status. All alpha diversity indices have similar prediction performance, with AUC being 0.50 to 0.53, which are much poorer than the proposed MRSα. Comparisons between MRSα and alpha diversity indices underline the significance of identification of the associated taxa in the microbial risk score framework, which condenses the signal by excluding the non-associated taxa and provides full potential for the proposed MRS to measure and predict disease susceptibility.

GMHI multi-study cohort

With the discovery and validation cohorts [27], we evaluated and validated the proposed MRSα in terms of predictive performance. Specifically, for CA, CC, CD, and RA diseases, respectively, we performed ANCOM-BC to identify candidate species that were differentially abundant between samples from healthy subjects and patients with this disease in the discovery cohort, constructed disease-specific MRSα based on the identified species, and performed the independent validation of disease-specific MRSα using samples from healthy subjects and patients with the disease in the validation cohort.

Figure 3A presents that AUC values and 95% confidence intervals for MRSαs to predict healthy and 4 different diseases in discovery and validation cohorts, respectively. Overall, MRSαs achieve great predictive performance in both discovery (AUCs: 0.60–0.88) and validation (AUCs: 0.68–0.86) cohorts. Notably three MRSαs (healthy vs. CA, healthy vs. CC, and healthy vs. RA) have higher AUCs in validation cohort, compared to discovery cohort. Among these four disease-specific MRSαs, MRSα specific for CD disease has the best predictive performance (AUC=0.88 in discovery and AUC=0.86 in validation). In addition, different MRSαs are constructed by different identified taxa. 5, 6, 12, and 6 taxa are used for constructions of MRSαs for CA, CC, CD, and RA, respectively (Fig. 3B; Table S3). Several taxa contribute multiple MRSαs, for example, species Bifidobacterium angulatum is involved for constructions of MRSαs for CA, CC, and RA (Table S3). On the other hand, 21 taxa are disease-specific and exclusively used in one MRSα (Table S3). They are differentially abundant in Healthy, CA, CC, CD, and RA samples (Tables S4 and S5). This demonstrates that the proposed MRS framework powerfully improves disease prediction by incorporating the disease-specific microbial profile. This feature makes the proposed MRS framework more crucial in practice, as most research studies aim to identify the microbial taxa specifically playing a role in a certain disease, rather than the generalized disease-associated microbial taxa.

Fig. 3
figure 3

Evaluation of MRS in the discovery and validation cohorts [27]. A The AUC values and 95% confidence intervals (CIs) for MRSαs to predict healthy and different disease conditions in discovery and validation cohorts, respectively. B Venn diagrams of taxa identified in pairwise comparisons of healthy versus CA, CC, CD, and RA. CA, colorectal adenoma; CC, colorectal cancer; CD, Crohn’s disease; and RA, rheumatoid arthritis

Similar to disease-specific MRS, we also assessed the MRS framework that distinguishes two disease groups, as well as healthy and nonhealthy conditions defined as in the original study [27] in the discovery and validation cohorts, respectively. Figure S2 presents the AUC values and 95% confidence intervals for MRSαs to classify any two diseases of CA, CC, CD, and RA, and healthy and nonhealthy conditions in discovery and validation cohorts, respectively. Table S3 correspondingly reports which taxa are involved for these MRSα calculations, respectively. Again, the MRS framework achieves notable performance. For example, discovery cohort has AUCs of 0.91 and 0.89, meanwhile, validation cohort has AUCs of 0.84 and 0.84, to distinguish CD from RA and CC, respectively. Validation cohort has a relatively lower AUC for classifying CA and RA, due to the small sample size. In terms of healthy vs. nonhealthy prediction, MRSα achieves consistently competitive performance but with much fewer species, whose AUCs are 0.7 and 0.71 in discovery and validation cohorts, respectively, compared to GMHI whose AUCs are 0.7 and 0.74 in discovery and validation cohorts, respectively. And the identified 6 species for MRSα construction is a subset of 50 microbial species used in GMHI [27].

Results of risk score-based multi-omics data integration

NYULH COVID-19 cohort

In addition to metagenome data, the NYULH COVID-19 cohort has metatranscriptome and host transcriptome data. In the following, we present how to integrate metagenomic, metatranscriptomic, and host transcriptomic datasets using the proposed MRSα and the evaluation of different methods. For the metatranscriptomic data, we employed the same MRS algorithm as we described in the Methods section, in terms of determining the p value cutoff, identifying candidate taxa, and constructing the microbial risk score, to construct its MRSα. In order to differentiate various MRSαs, we denoted the MRSα using the metagenomic and metatranscriptomic data by DNA_MRSα and RNA_MRSα, respectively, in the rest of manuscript. For the transcriptomic data, we employed DESeq2 [36] to evaluate the association effects of genes on the deceased/alive status, determined the p value cutoff based on the P+T method, and identified the candidate genes by AUC evaluation. Then, we defined the weighted sum of log-transformed counts of the selected candidate genes for each sample as the risk score (denoted as Host), with the weight being 1 if the corresponding logarithmic fold change estimate from DESeq2 was positive, otherwise −1. Computational details are reported in Section S1. Figure 4A shows that the risk scores based on metagenomic, metatranscriptomic, and host transcriptomic data separately have the AUC values of 0.74, 0.69, and 0.63, respectively, in terms of predicting deceased/alive status. Furthermore, the combinations of risk scores from different datasets can obviously improve the predictive performance (Fig. 4B) of mortality. The combinations of any two datasets have comparable AUC values and perform similarly. As expected, the integration of all three datasets (DNA_MRSα + RNA_MRSα + Host) has the highest AUC of 0.85, which yields at least a 15% increase in AUC compared to DNA_MRSα, RNA_MRSα, or Host alone. In Fig. 5, comparing the risk scores between the alive and deceased groups, the deceased group always has a significantly higher average risk score than the alive group, and no matter the score was constructed based on a single omics dataset or the integration of different omics datasets (p values<0.05).

Fig. 4
figure 4

The ROC curves and AUC values for the various risk scores to predict alive and deceased status in the NYULH COVID-19 cohort. A Predication performance for the individual risk scores constructed based on metagenome (DNA_MRSα), metatranscriptiome (RNA_MRSα), and host transcriptome (Host), separately. B Predication performance based on multiple risk scores using additive model

Fig. 5
figure 5

Box plots of the score comparisons between alive and deceased group. All risk scores are standardized among all samples, respectively. The statistical significance on a group comparison is evaluated by Wilcoxon signed-rank test

Figure 6 presents the 2D or 3D scatterplots of risk scores from metagenomics, metatranscriptomic, and host transcriptomic data. The subjects were first classified into “high risk” and “low risk” groups by each risk score’s mean. We next checked how well these risk classifications can be used to predict disease status by reporting the classification metrics [55]: sensitivity, specificity, accuracy, and F1 score in Table 1. Specifically, the predicted values for the subjects labeled as “high risk” by two risk scores (in Fig. 6A–C) or by all three risk scores (in Fig. 6D) datasets are “deceased,” and the predicted values for the subjects labeled as “low risk” by two risk scores (in Fig. 6A–C) or by all three risk scores (in Fig. 6D) datasets are “alive.” From Table 1, we can see that among the combinations of two risk scores for classification, the combination of metagenomic and host transcriptomic risk scores has the highest sensitivity, accuracy, and F1 score, but is still inferior to the combination of all three omics risk scores, which identifies the mortality status with 86% sensitivity, 91% specificity, 88% accuracy, and an F1 score of 0.89. In this real study, from different angles, including the AUC in Fig. 4, the scatterplots of risk scores in Fig. 6, and the test results in Table 1, we show that combining risk scores from metagenomics, metatranscriptomic, and host transcriptomic data increases the predictive accuracy for COVID-19 mortality.

Fig. 6
figure 6

Scatterplots of risk scores based on metagenome, metatranscriptome, and host transcriptome data. AC Scatterplots of DNA_MRSα vs RNA_MRSα, DNA_MRSα vs Host, and RNA_MRSα vs Host, respectively. Dotted line denotes the mean of the corresponding risk score across all subjects. D 3D scatterplot of DNA_MRSαvs RNA_MRSαvs Host

Table 1 Classification evaluation for subjects having extreme risk categories (labeled as either “high risk” or “low risk” by both or all three risk scores) in the NYULH COVID-19 cohort
Fig. 7
figure 7

Heatmaps of Spearman’s rank correlations between the top 50 taxa from metagenome and the top 50 genes from host transcriptome, in alive and deceased groups, separately. The top 50 features were selected based on the proportion of selection in all CV iterations

Table S6 reports the included features in the metagenomic, metatranscriptomic, and host transcriptomic risk scores separately. The feature importance was determined by the selection proportion among all CV iterations. For the host transcriptomic data, the fold change between deceased and alive was used to determine the feature importance when the selection proportions were the same. Here, we take the top 50 features in each data as an illustration to investigate the correlation networks among these three datasets. Figures 7, S3, and S4 show the paired correlation heatmaps among the selected metagenomic, metatranscriptomic, and host transcriptomic features in the alive and deceased groups, respectively. Notably, the alive and deceased groups have different correlation patterns among these top 50 features from any two datasets. Specifically, the metagenomics features tend to have stronger correlations with the host transcriptomic and metatranscriptomic features in the deceased group, compared to the alive group; and the metatranscriptomic features tend to have more negative correlations with the host transcriptome in the alive group.

Note that the results reported in this section are different from those in Sulaiman et al. [26] in which the main goal was to reveal the scientific findings and the Cox proportional-hazards model [51] was employed to identify the candidate taxa and genes associated with the time to death. In this paper, we formally introduce the MRS concept and propose it as a general method with the detailed instruction on how to construct MRS. As a validation of the proposed method, the results presented above based on the binary outcome (deceased vs. alive) agree with the previous scientific conclusions [26]. Table 2 reports the hazard ratios of all risk scores constructed in this paper and their combinations on the time to death based on the Cox proportional-hazards model. All risk scores are significantly associated with the time to death. As we found in [26], metatranscriptomic data alone, or combined with the other two datasets, always has a higher hazard of death, because it involves SARS-CoV-2 viral, which is a key risk factor on the COVID-19 mortality.

Table 2 Association results between the risk scores and the time to death based on the Cox proportional-hazards model in the NYULH COVID-19 cohort

Overall, these results highlight that the proposed community-based MRSα can characterize and summarize the microbial profiles effectively and provide a flexible way to integrate microbiome data with other omics data. Integrations of risk scores from different omics data further improves the predictive performance on the alive/deceased status in the NYULH COVID-19 study.

TEDDY study

Although the TEDDY cohort includes both genome and microbiome data, the previous microbiome research on TEDDY study [53, 54] focused exclusively on the microbiome profiles and only identified very few microbial signatures associated with T1D. Given the fact that T1D is a multifactorial disease caused by both genetic and environmental factors and the children enrolled in the TEDDY study, all have high-genetic risk for the T1D development (they have at least one of nine HLA DR-DQ genotypes associated with high risk for T1D) [29], we here propose a new angle to employ the proposed MRS along with the existing PRS for T1D to investigate the combined effect of microbial profile and host genetic profile on T1D risk prediction. Specifically, we analyzed 551 TEDDY subjects who have both microbiome data and genotype data; 75 of them developed T1D. Using the available genotype data and the PRS algorithm which has the robust and superior prediction performance on T1D [48, 49], we built the PRS for subjects. We used the microbial samples that were collected at the time point most close to month 30 when microbiome profile got stable and the largest sample size was available, to build MRSα to predict T1D status independently. The practice of MRS calculations are the same as those used in the NYULH COVID-19 study.

Figure 8A compares the AUCs for predicting T1D based on the individual risk scores and the combination of PRS and MRSα, and Fig. 8B–D shows the Kaplan–Meier survival curve comparisons between high- and low-risk group identified by PRS, MRSα, and PRS + MRSα, respectively. Specifically, subjects whose risk scores are above the third quartile are defined as high risk, others as low risk. Although the predictive models considered in Fig. 8A have only modest predictive ability in the TEDDY cohort (AUC range [0.58, 0.63]), we found that integrating PRS and MRSα scores is more useful in stratifying the subjects into high- and low-risk groups for the T1D development (Fig. 8D) than the PRS (Fig. 8B) or MRSα (Fig. 8C) alone, which indicates that the potential genetic-microbial interaction effect on the T1D progression. These results exhibit the utility of modeling multi-omics risk scores to identify the high-risk populations who can benefit from more targeted interventions.

Fig. 8
figure 8

Results for T1D prediction in the TEDDY study. A ROC curves and AUC values for predicting T1D status using various risk score. PRS_hla is constructed from the HLA alleles alone, and PRS is constructed from all SNPs found in the TEDDY cohort based on the existing PRS algorithm [49]. MRSα is the negative alpha diversity (Shannon index) calculated on the selected sub-community, which is selected by ANCOM-BC method and P+T method. BD. Kaplan–Meier plots for the groups of subjects at high and low risk of developing T1D, based on PRS, MRSα, and the combination of PRS and MRSα, respectively. Subjects whose risk scores are above the third quartile are defined as high risk, others as low risk

Discussion

With the recent proliferation of large-scale microbial association studies, we propose a two-step novel microbial risk score framework to aggregate the high-dimensional microbiome profile into a summarized risk score and apply it in disease prediction. Specifically, we first identify the associated taxa based on the recommended microbial association tests by two recent benchmarking works [32, 33] and P+T method, and then construct a community-based MRSα, because that the microbiome is a complex ecosystem composed of numerous sub-communities, and its influence on the disease development acts at the community instead of the single-microbe level and is disease-dependent. The application in the NYULH COVID-19 cohort demonstrates the superior performance of MRSα in the disease prediction, compared to the standard MRSS, which is constructed similarly as PRS, ML-based prediction algorithms, and six alpha diversity measures on the whole microbiome community. The evaluation of MRSα using the GMHI integrated dataset which consists of independent discovery and validation cohorts reveals the notable reproducibility of MRSα in terms of disease prediction.

Combining omics datasets that provide biological information from different layers is vital to comprehensively study phenotypes and accurately predict diseases. However, complex data structures, for example, high-dimensionality, sparsity, compositionality, interdependence, and hierarchical tree structures, all make multi-omics data integration challenging. In this paper, the proposed MRS provides a straightforward and flexible way to incorporate multi-omics datasets and explore the microbial interactions with other omics profiles. Integration of the proposed MRS and the risk scores constructed from other omics data increases the ability for disease prediction. Integrations of metagenomic with metatranscriptomic and host transcriptomic datasets from NYULH COVID-19 cohort underline the critical and insightful utility of the constructed risk scores for disease prediction and the promising ability of multi-omics data integration for predictive accuracy improvement. Additionally, the data from TEDDY study illuminates the potential in combining MRS and PRS to explore genetic-microbial interaction and identify the high-risk population.

Apart from the ANCOM-BC and Shannon index used in the proposed MRSα, there are other differential abundance methods available to identify the signature microbial taxa associated with disease and other alpha diversity indices to characterize the community diversity. Here, we investigate how does using two other differential abundance methods (ALDEx2 [56] and Maaslin2 [57] suggested by [32, 33]) and Simpson and observed alpha diversity indices to construct MRSα affect the predictive performance of the MRS framework in terms of AUC value and 95% CI in the discovery and validation cohorts [27] separately. Figure S5 shows that no single MRSα can uniformly perform best for all predictions in the discovery and validation cohorts, as various DA methods have different model assumptions and test hypotheses and various alpha diversities indices have different definitions, while links between microbiome profile with various healthy or disease conditions are different. Specifically, given an alpha diversity index in the second step, DA method has no effect on the prediction performance of MRSα in both discovery and validation cohorts (p value>0.05) using the Kruskal-Wallis test on the AUC, except for Simpson index in the discovery cohort (p value=0.03) (Figure S6). MRSαs constructed with ANCOM-BC, ALDEx2, and Maaslin2, which all have been well-recognized [32, 33] and have comparable performances. It supports our suggestion that to carry over the evaluation results of the DA tests from an objective benchmark work to guide the selection of DA test in the MRS framework. In terms of comparisons among Shannon, Simpson, and Observed indices, Observed index-based MRSα has the highest AUCs, followed by Shannon index, while Simpson index has the lowest AUCs, in the discovery cohort (Figure S7). On the other hand, Shannon index consistently has better or comparable AUCs in the validation cohort. Meanwhile, Observed and Simpson indices introduce more variation in the predictive performance of MRSα (Figure S7). Observed index lacks some reproducibility in the validation cohort, compared to its impressing performance in the discovery cohort, probably because it only accounts for species richness. Taken together, Shannon index based MRSα has relatively more robust and consistent prediction performance. With existing discussions [32, 33] and the observations above in this manuscript, we include various DA methods commonly used and recommended in the microbiome association studies and various alpha diversity indices in the MRS R package to let the proposed MRS framework informative and more practically valuable.

The findings of this study have some limitations. First, considering microbial profile varies across ethnicities as well as geographies [58,59,60], it is necessary to evaluate the portability of MRS between populations. More advanced methods will be required to reduce the bias due to ethnical or geographical differences. Second, the microbiome data have versatile characteristics and unique features, such as phylogenetic tree structure, functional structure, hierarchical taxonomy, and dynamic nature, which also play critical roles in analytical accuracy and efficiency [61, 62]. Incorporating such features may improve the accuracy of MRS. Third, derivation and validation of MRS require large-scale microbiome studies. However, the high cost of metagenomics sequencing restrict the comprehensive external validation. Despite the above challenges, this paper proposes a practicable way to summarize the microbial profiles and provides promising findings for comprehensive microbiome research to bolster the microbiome’s utility as a potential source of novel therapeutic features.

Conclusions

This paper sheds light on the utility of the microbiome data for disease prediction and multi-omics integration by converting the complex microbial profile into a continuous risk score. The proposed MRS tool provides a great potential in studying the complex microbial ecosystem, understanding the microbiome’s role in disease diagnosis and prognosis, and exploring microbiome’s full clinical potential.